Warning: Permanently added '54.90.80.36' (ED25519) to the list of known hosts. You can reproduce this build on your computer by running: sudo dnf install copr-rpmbuild /usr/bin/copr-rpmbuild --verbose --drop-resultdir --task-url https://copr.fedorainfracloud.org/backend/get-build-task/7325889-epel-8-x86_64 --chroot epel-8-x86_64 Version: 0.72 PID: 6603 Logging PID: 6604 Task: {'allow_user_ssh': False, 'appstream': False, 'background': False, 'build_id': 7325889, 'buildroot_pkgs': [], 'chroot': 'epel-8-x86_64', 'enable_net': True, 'fedora_review': False, 'git_hash': 'fe22476c3c0b61ebd0a9858693b287e4007599c0', 'git_repo': 'https://copr-dist-git.fedorainfracloud.org/git/rezso/ML/cutlass', 'isolation': 'default', 'memory_reqs': 2048, 'package_name': 'cutlass', 'package_version': '3.5.0-20240411.1.cu12_4', 'project_dirname': 'ML', 'project_name': 'ML', 'project_owner': 'rezso', 'repo_priority': None, 'repos': [{'baseurl': 'https://download.copr.fedorainfracloud.org/results/rezso/ML/epel-8-x86_64/', 'id': 'copr_base', 'name': 'Copr repository', 'priority': None}, {'baseurl': 'https://download.copr.fedorainfracloud.org/results/rezso/CUDA/epel-8-x86_64/', 'id': 'copr_rezso_CUDA', 'name': 'Additional repo copr_rezso_CUDA'}, {'baseurl': 'http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64', 'id': 'http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64', 'name': 'Additional repo http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64'}, {'baseurl': 'http://developer.download.nvidia.com/compute/cuda/repos/rhel8/sbsa', 'id': 'http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa', 'name': 'Additional repo http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa'}, {'baseurl': 'http://developer.download.nvidia.com/compute/cuda/repos/rhel8/ppc64le', 'id': 'http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le', 'name': 'Additional repo http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le'}], 'sandbox': 'rezso/ML--rezso', 'source_json': {}, 'source_type': None, 'ssh_public_keys': None, 'submitter': 'rezso', 'tags': [], 'task_id': '7325889-epel-8-x86_64', 'timeout': 172800, 'uses_devel_repo': False, 'with_opts': [], 'without_opts': []} Running: git clone https://copr-dist-git.fedorainfracloud.org/git/rezso/ML/cutlass /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass --depth 500 --no-single-branch --recursive cmd: ['git', 'clone', 'https://copr-dist-git.fedorainfracloud.org/git/rezso/ML/cutlass', '/var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass', '--depth', '500', '--no-single-branch', '--recursive'] cwd: . rc: 0 stdout: stderr: Cloning into '/var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass'... Running: git checkout fe22476c3c0b61ebd0a9858693b287e4007599c0 -- cmd: ['git', 'checkout', 'fe22476c3c0b61ebd0a9858693b287e4007599c0', '--'] cwd: /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass rc: 0 stdout: stderr: Note: switching to 'fe22476c3c0b61ebd0a9858693b287e4007599c0'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at fe22476 automatic import of cutlass Running: copr-distgit-client sources cmd: ['copr-distgit-client', 'sources'] cwd: /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass rc: 0 stdout: stderr: INFO: Reading stdout from command: git rev-parse --abbrev-ref HEAD INFO: Reading stdout from command: git rev-parse HEAD INFO: Reading sources specification file: sources /usr/bin/tail: /var/lib/copr-rpmbuild/main.log: file truncated Running (timeout=172800): unbuffer mock --spec /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass/cutlass.spec --sources /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass --resultdir /var/lib/copr-rpmbuild/results --uniqueext 1713469181.334935 -r /var/lib/copr-rpmbuild/results/configs/child.cfg INFO: mock.py version 5.5 starting (python version = 3.12.1, NVR = mock-5.5-1.fc39), args: /usr/libexec/mock/mock --spec /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass/cutlass.spec --sources /var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass --resultdir /var/lib/copr-rpmbuild/results --uniqueext 1713469181.334935 -r /var/lib/copr-rpmbuild/results/configs/child.cfg Start(bootstrap): init plugins INFO: tmpfs initialized INFO: selinux enabled INFO: chroot_scan: initialized INFO: compress_logs: initialized Finish(bootstrap): init plugins Start: init plugins INFO: tmpfs initialized INFO: selinux enabled INFO: chroot_scan: initialized INFO: compress_logs: initialized Finish: init plugins INFO: Signal handler active Start: run INFO: Start(/var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass/cutlass.spec) Config(rhel+epel-8-x86_64) Start: clean chroot Finish: clean chroot Mock Version: 5.5 INFO: Mock Version: 5.5 Start(bootstrap): chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-x86_64-bootstrap-1713469181.334935/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start(bootstrap): cleaning package manager metadata Finish(bootstrap): cleaning package manager metadata INFO: Guessed host environment type: unknown INFO: Using bootstrap image: registry.access.redhat.com/ubi8/ubi INFO: Pulling image: registry.access.redhat.com/ubi8/ubi INFO: Copy content of container registry.access.redhat.com/ubi8/ubi to /var/lib/mock/rhel+epel-8-x86_64-bootstrap-1713469181.334935/root INFO: Checking that registry.access.redhat.com/ubi8/ubi image matches host's architecture INFO: mounting registry.access.redhat.com/ubi8/ubi with podman image mount INFO: image registry.access.redhat.com/ubi8/ubi as /var/lib/containers/storage/overlay/31ef0364e9a5089fff79d6ab4a2ccac8398c4aadd2d838b72e7f5fe1b77a4562/merged INFO: umounting image registry.access.redhat.com/ubi8/ubi (/var/lib/containers/storage/overlay/31ef0364e9a5089fff79d6ab4a2ccac8398c4aadd2d838b72e7f5fe1b77a4562/merged) with podman image umount INFO: Package manager dnf detected and used (fallback) INFO: Not updating bootstrap chroot, bootstrap_image_ready=True Start(bootstrap): creating root cache Finish(bootstrap): creating root cache Finish(bootstrap): chroot init Start: chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start: cleaning package manager metadata Finish: cleaning package manager metadata INFO: enabled HW Info plugin INFO: Package manager dnf detected and used (direct choice) INFO: Buildroot is handled by package management downloaded with a bootstrap image: rpm-4.14.3-28.el8_9.x86_64 python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch yum-4.7.0-19.el8.noarch Start: installing minimal buildroot with dnf No matches found for the following disable plugin patterns: local, spacewalk, versionlock Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Copr repository 17 MB/s | 1.0 MB 00:00 Additional repo copr_rezso_CUDA 2.2 MB/s | 71 kB 00:00 Additional repo http_developer_download_nvidia_ 173 MB/s | 3.3 MB 00:00 Additional repo http_developer_download_nvidia_ 118 MB/s | 2.0 MB 00:00 Additional repo http_developer_download_nvidia_ 120 MB/s | 1.8 MB 00:00 Red Hat Enterprise Linux - BaseOS 125 MB/s | 67 MB 00:00 Red Hat Enterprise Linux - AppStream 117 MB/s | 60 MB 00:00 Red Hat Enterprise Linux - CodeReady Linux Buil 18 MB/s | 9.2 MB 00:00 Extra Packages for Enterprise Linux 8 - x86_64 108 MB/s | 16 MB 00:00 Modular dependency problems: Problem 1: nothing provides requested module(nvidia-driver:latest-dkms:20240416084055) Problem 2: nothing provides requested module(nvidia-driver:latest-dkms:20240416084208) Dependencies resolved. =========================================================================================== Package Arch Version Repository Size =========================================================================================== Installing: bash x86_64 4.4.20-4.el8_6 rhel-baseos 1.5 M bzip2 x86_64 1.0.6-26.el8 rhel-baseos 60 k coreutils x86_64 8.30-15.el8 rhel-baseos 1.2 M cpio x86_64 2.12-11.el8 rhel-baseos 266 k diffutils x86_64 3.6-6.el8 rhel-baseos 359 k epel-rpm-macros noarch 8-41 epel 27 k findutils x86_64 1:4.6.0-21.el8 rhel-baseos 527 k gawk x86_64 4.2.1-4.el8 rhel-baseos 1.1 M gcc x86_64 8.5.0-20.el8 rhel-appstream 23 M gcc-c++ x86_64 8.5.0-20.el8 rhel-appstream 12 M grep x86_64 3.1-6.el8 rhel-baseos 274 k gzip x86_64 1.9-13.el8_5 rhel-baseos 167 k info x86_64 6.5-7.el8 rhel-baseos 198 k make x86_64 1:4.2.1-11.el8 rhel-baseos 498 k patch x86_64 2.7.6-11.el8 rhel-baseos 138 k redhat-release x86_64 8.9-0.1.el8 rhel-baseos 45 k redhat-rpm-config noarch 131-1.el8 rhel-appstream 91 k rpm-build x86_64 4.14.3-28.el8_9 rhel-appstream 174 k sed x86_64 4.5-5.el8 rhel-baseos 298 k tar x86_64 2:1.30-9.el8 rhel-baseos 839 k unzip x86_64 6.0-46.el8 rhel-baseos 196 k util-linux x86_64 2.32.1-44.el8_9.1 rhel-baseos 2.5 M which x86_64 2.21-20.el8 rhel-baseos 50 k xz x86_64 5.2.4-4.el8_6 rhel-baseos 153 k Installing dependencies: annobin x86_64 11.13-2.el8 rhel-appstream 972 k ansible-srpm-macros noarch 1-12.el8 epel 21 k audit-libs x86_64 3.0.7-5.el8 rhel-baseos 123 k basesystem noarch 11-5.el8 rhel-baseos 11 k binutils x86_64 2.30-123.el8 rhel-baseos 5.8 M brotli x86_64 1.0.6-3.el8 rhel-baseos 323 k bzip2-libs x86_64 1.0.6-26.el8 rhel-baseos 48 k ca-certificates noarch 2023.2.60_v7.0.306-80.0.el8_8 rhel-baseos 935 k chkconfig x86_64 1.19.2-1.el8 rhel-baseos 199 k coreutils-common x86_64 8.30-15.el8 rhel-baseos 2.0 M cpp x86_64 8.5.0-20.el8 rhel-appstream 10 M cracklib x86_64 2.9.6-15.el8 rhel-baseos 93 k cracklib-dicts x86_64 2.9.6-15.el8 rhel-baseos 4.0 M crypto-policies noarch 20230731-1.git3177e06.el8 rhel-baseos 64 k curl x86_64 7.61.1-33.el8_9.5 rhel-baseos 354 k cyrus-sasl-lib x86_64 2.1.27-6.el8_5 rhel-baseos 123 k dwz x86_64 0.12-10.el8 rhel-appstream 109 k efi-srpm-macros noarch 3-3.el8 rhel-appstream 22 k elfutils x86_64 0.189-3.el8 rhel-baseos 553 k elfutils-default-yama-scope noarch 0.189-3.el8 rhel-baseos 52 k elfutils-libelf x86_64 0.189-3.el8 rhel-baseos 232 k elfutils-libs x86_64 0.189-3.el8 rhel-baseos 303 k expat x86_64 2.2.5-11.el8_9.1 rhel-baseos 114 k file x86_64 5.33-25.el8 rhel-baseos 77 k file-libs x86_64 5.33-25.el8 rhel-baseos 544 k filesystem x86_64 3.8-6.el8 rhel-baseos 1.1 M fpc-srpm-macros noarch 1.3-1.el8 epel 8.2 k gc x86_64 7.6.4-3.el8 rhel-appstream 109 k gcc-plugin-annobin x86_64 8.5.0-20.el8 rhel-appstream 36 k gdb-headless x86_64 8.2-20.el8 rhel-appstream 3.7 M gdbm x86_64 1:1.18-2.el8 rhel-baseos 130 k gdbm-libs x86_64 1:1.18-2.el8 rhel-baseos 60 k ghc-srpm-macros noarch 1.4.2-7.el8 rhel-appstream 9.4 k glib2 x86_64 2.56.4-161.el8 rhel-baseos 2.5 M glibc x86_64 2.28-236.el8_9.12 rhel-baseos 2.2 M glibc-all-langpacks x86_64 2.28-236.el8_9.12 rhel-baseos 26 M glibc-common x86_64 2.28-236.el8_9.12 rhel-baseos 1.0 M glibc-devel x86_64 2.28-236.el8_9.12 rhel-baseos 86 k glibc-gconv-extra x86_64 2.28-236.el8_9.12 rhel-baseos 1.6 M glibc-headers x86_64 2.28-236.el8_9.12 rhel-baseos 491 k gmp x86_64 1:6.1.2-10.el8 rhel-baseos 321 k gnupg2 x86_64 2.2.20-3.el8_6 rhel-baseos 2.4 M gnutls x86_64 3.6.16-8.el8_9.3 rhel-baseos 1.0 M go-srpm-macros noarch 2-17.el8 rhel-appstream 13 k guile x86_64 5:2.0.14-7.el8 rhel-appstream 3.5 M ima-evm-utils x86_64 1.3.2-12.el8 rhel-baseos 64 k isl x86_64 0.16.1-6.el8 rhel-appstream 841 k kernel-headers x86_64 4.18.0-513.24.1.el8_9 rhel-baseos 11 M keyutils-libs x86_64 1.5.10-9.el8 rhel-baseos 34 k krb5-libs x86_64 1.18.2-26.el8_9 rhel-baseos 842 k libacl x86_64 2.2.53-1.el8 rhel-baseos 35 k libarchive x86_64 3.3.3-5.el8 rhel-baseos 360 k libassuan x86_64 2.5.1-3.el8 rhel-baseos 83 k libatomic_ops x86_64 7.6.2-3.el8 rhel-appstream 38 k libattr x86_64 2.4.48-3.el8 rhel-baseos 27 k libbabeltrace x86_64 1.5.4-4.el8 rhel-baseos 200 k libblkid x86_64 2.32.1-44.el8_9.1 rhel-baseos 221 k libcap x86_64 2.48-6.el8_9 rhel-baseos 74 k libcap-ng x86_64 0.7.11-1.el8 rhel-baseos 33 k libcom_err x86_64 1.45.6-5.el8 rhel-baseos 49 k libcurl x86_64 7.61.1-33.el8_9.5 rhel-baseos 304 k libdb x86_64 5.3.28-42.el8_4 rhel-baseos 751 k libdb-utils x86_64 5.3.28-42.el8_4 rhel-baseos 150 k libfdisk x86_64 2.32.1-44.el8_9.1 rhel-baseos 254 k libffi x86_64 3.1-24.el8 rhel-baseos 38 k libgcc x86_64 8.5.0-20.el8 rhel-baseos 81 k libgcrypt x86_64 1.8.5-7.el8_6 rhel-baseos 463 k libgomp x86_64 8.5.0-20.el8 rhel-baseos 208 k libgpg-error x86_64 1.31-1.el8 rhel-baseos 242 k libidn2 x86_64 2.2.0-1.el8 rhel-baseos 94 k libipt x86_64 1.6.1-8.el8 rhel-appstream 50 k libksba x86_64 1.3.5-9.el8_7 rhel-baseos 134 k libmount x86_64 2.32.1-44.el8_9.1 rhel-baseos 237 k libmpc x86_64 1.1.0-9.1.el8 rhel-appstream 61 k libnghttp2 x86_64 1.33.0-5.el8_9 rhel-baseos 78 k libnsl2 x86_64 1.2.0-2.20180605git4a062cf.el8 rhel-baseos 58 k libpkgconf x86_64 1.4.2-1.el8 rhel-baseos 35 k libpsl x86_64 0.20.2-6.el8 rhel-baseos 61 k libpwquality x86_64 1.4.4-6.el8 rhel-baseos 107 k libselinux x86_64 2.9-8.el8 rhel-baseos 166 k libsemanage x86_64 2.9-9.el8_6 rhel-baseos 168 k libsepol x86_64 2.9-3.el8 rhel-baseos 340 k libsigsegv x86_64 2.11-5.el8 rhel-baseos 30 k libsmartcols x86_64 2.32.1-44.el8_9.1 rhel-baseos 180 k libssh x86_64 0.9.6-13.el8_9 rhel-baseos 220 k libssh-config noarch 0.9.6-13.el8_9 rhel-baseos 21 k libstdc++ x86_64 8.5.0-20.el8 rhel-baseos 455 k libstdc++-devel x86_64 8.5.0-20.el8 rhel-appstream 2.1 M libtasn1 x86_64 4.13-4.el8_7 rhel-baseos 76 k libtirpc x86_64 1.1.4-8.el8 rhel-baseos 113 k libtool-ltdl x86_64 2.4.6-25.el8 rhel-baseos 58 k libunistring x86_64 0.9.9-3.el8 rhel-baseos 422 k libusbx x86_64 1.0.23-4.el8 rhel-baseos 74 k libutempter x86_64 1.1.6-14.el8 rhel-baseos 32 k libuuid x86_64 2.32.1-44.el8_9.1 rhel-baseos 99 k libverto x86_64 0.3.2-2.el8 rhel-baseos 24 k libxcrypt x86_64 4.1.1-6.el8 rhel-baseos 73 k libxcrypt-devel x86_64 4.1.1-6.el8 rhel-baseos 25 k libxml2 x86_64 2.9.7-18.el8_9 rhel-baseos 697 k libzstd x86_64 1.4.4-1.el8 rhel-baseos 266 k lua-libs x86_64 5.3.4-12.el8 rhel-baseos 118 k lua-srpm-macros noarch 1-13.el8 epel 9.2 k lz4-libs x86_64 1.8.3-3.el8_4 rhel-baseos 66 k mpfr x86_64 3.1.6-1.el8 rhel-baseos 221 k ncurses x86_64 6.1-10.20180224.el8 rhel-baseos 387 k ncurses-base noarch 6.1-10.20180224.el8 rhel-baseos 81 k ncurses-libs x86_64 6.1-10.20180224.el8 rhel-baseos 334 k nettle x86_64 3.4.1-7.el8 rhel-baseos 301 k npth x86_64 1.5-4.el8 rhel-baseos 26 k ocaml-srpm-macros noarch 5-4.el8 rhel-appstream 9.5 k openblas-srpm-macros noarch 2-2.el8 rhel-appstream 8.0 k openldap x86_64 2.4.46-18.el8 rhel-baseos 352 k openssl-libs x86_64 1:1.1.1k-12.el8_9 rhel-baseos 1.5 M p11-kit x86_64 0.23.22-1.el8 rhel-baseos 324 k p11-kit-trust x86_64 0.23.22-1.el8 rhel-baseos 137 k pam x86_64 1.3.1-27.el8 rhel-baseos 746 k pcre x86_64 8.42-6.el8 rhel-baseos 211 k pcre2 x86_64 10.32-3.el8_6 rhel-baseos 247 k perl-srpm-macros noarch 1-25.el8 rhel-appstream 11 k pkgconf x86_64 1.4.2-1.el8 rhel-baseos 38 k pkgconf-m4 noarch 1.4.2-1.el8 rhel-baseos 17 k pkgconf-pkg-config x86_64 1.4.2-1.el8 rhel-baseos 15 k platform-python x86_64 3.6.8-56.el8_9.3 rhel-baseos 87 k platform-python-setuptools noarch 39.2.0-7.el8 rhel-baseos 632 k popt x86_64 1.18-1.el8 rhel-baseos 61 k publicsuffix-list-dafsa noarch 20180723-1.el8 rhel-baseos 56 k python-rpm-macros noarch 3-45.el8 rhel-appstream 16 k python-srpm-macros noarch 3-45.el8 rhel-appstream 16 k python3-libs x86_64 3.6.8-56.el8_9.3 rhel-baseos 7.8 M python3-pip-wheel noarch 9.0.3-23.el8_9.1 rhel-baseos 866 k python3-rpm-macros noarch 3-45.el8 rhel-appstream 15 k python3-setuptools-wheel noarch 39.2.0-7.el8 rhel-baseos 289 k qt5-srpm-macros noarch 5.15.3-1.el8 rhel-appstream 11 k readline x86_64 7.0-10.el8 rhel-baseos 199 k rpm x86_64 4.14.3-28.el8_9 rhel-baseos 544 k rpm-build-libs x86_64 4.14.3-28.el8_9 rhel-baseos 157 k rpm-libs x86_64 4.14.3-28.el8_9 rhel-baseos 348 k rust-srpm-macros noarch 5-2.el8 rhel-appstream 9.3 k setup noarch 2.12.2-9.el8 rhel-baseos 181 k shadow-utils x86_64 2:4.6-19.el8 rhel-baseos 1.2 M sqlite-libs x86_64 3.26.0-19.el8_9 rhel-baseos 581 k systemd-libs x86_64 239-78.el8 rhel-baseos 1.1 M tpm2-tss x86_64 2.3.2-5.el8 rhel-baseos 275 k tzdata noarch 2024a-1.el8 rhel-baseos 475 k xz-libs x86_64 5.2.4-4.el8_6 rhel-baseos 94 k zip x86_64 3.0-23.el8 rhel-baseos 270 k zlib x86_64 1.2.11-25.el8 rhel-baseos 103 k zstd x86_64 1.4.4-1.el8 rhel-appstream 393 k Transaction Summary =========================================================================================== Install 172 Packages Total download size: 163 M Installed size: 813 M Downloading Packages: (1/172): cracklib-2.9.6-15.el8.x86_64.rpm 817 kB/s | 93 kB 00:00 (2/172): bzip2-libs-1.0.6-26.el8.x86_64.rpm 387 kB/s | 48 kB 00:00 (3/172): bzip2-1.0.6-26.el8.x86_64.rpm 462 kB/s | 60 kB 00:00 (4/172): grep-3.1-6.el8.x86_64.rpm 3.6 MB/s | 274 kB 00:00 (5/172): libassuan-2.5.1-3.el8.x86_64.rpm 775 kB/s | 83 kB 00:00 (6/172): cracklib-dicts-2.9.6-15.el8.x86_64.rpm 29 MB/s | 4.0 MB 00:00 (7/172): libattr-2.4.48-3.el8.x86_64.rpm 429 kB/s | 27 kB 00:00 (8/172): libunistring-0.9.9-3.el8.x86_64.rpm 7.7 MB/s | 422 kB 00:00 (9/172): libutempter-1.1.6-14.el8.x86_64.rpm 499 kB/s | 32 kB 00:00 (10/172): libsigsegv-2.11-5.el8.x86_64.rpm 245 kB/s | 30 kB 00:00 (11/172): mpfr-3.1.6-1.el8.x86_64.rpm 3.3 MB/s | 221 kB 00:00 (12/172): npth-1.5-4.el8.x86_64.rpm 396 kB/s | 26 kB 00:00 (13/172): pkgconf-1.4.2-1.el8.x86_64.rpm 675 kB/s | 38 kB 00:00 (14/172): pkgconf-pkg-config-1.4.2-1.el8.x86_64 246 kB/s | 15 kB 00:00 (15/172): readline-7.0-10.el8.x86_64.rpm 2.2 MB/s | 199 kB 00:00 (16/172): basesystem-11-5.el8.noarch.rpm 157 kB/s | 11 kB 00:00 (17/172): zip-3.0-23.el8.x86_64.rpm 2.6 MB/s | 270 kB 00:00 (18/172): libacl-2.2.53-1.el8.x86_64.rpm 413 kB/s | 35 kB 00:00 (19/172): libgpg-error-1.31-1.el8.x86_64.rpm 2.9 MB/s | 242 kB 00:00 (20/172): libnsl2-1.2.0-2.20180605git4a062cf.el 918 kB/s | 58 kB 00:00 (21/172): libpkgconf-1.4.2-1.el8.x86_64.rpm 640 kB/s | 35 kB 00:00 (22/172): pkgconf-m4-1.4.2-1.el8.noarch.rpm 330 kB/s | 17 kB 00:00 (23/172): libtool-ltdl-2.4.6-25.el8.x86_64.rpm 939 kB/s | 58 kB 00:00 (24/172): publicsuffix-list-dafsa-20180723-1.el 1.0 MB/s | 56 kB 00:00 (25/172): gmp-6.1.2-10.el8.x86_64.rpm 3.9 MB/s | 321 kB 00:00 (26/172): diffutils-3.6-6.el8.x86_64.rpm 4.4 MB/s | 359 kB 00:00 (27/172): libidn2-2.2.0-1.el8.x86_64.rpm 1.3 MB/s | 94 kB 00:00 (28/172): patch-2.7.6-11.el8.x86_64.rpm 2.6 MB/s | 138 kB 00:00 (29/172): libzstd-1.4.4-1.el8.x86_64.rpm 4.5 MB/s | 266 kB 00:00 (30/172): libusbx-1.0.23-4.el8.x86_64.rpm 716 kB/s | 74 kB 00:00 (31/172): p11-kit-trust-0.23.22-1.el8.x86_64.rp 2.6 MB/s | 137 kB 00:00 (32/172): libpsl-0.20.2-6.el8.x86_64.rpm 527 kB/s | 61 kB 00:00 (33/172): popt-1.18-1.el8.x86_64.rpm 696 kB/s | 61 kB 00:00 (34/172): brotli-1.0.6-3.el8.x86_64.rpm 2.5 MB/s | 323 kB 00:00 (35/172): ima-evm-utils-1.3.2-12.el8.x86_64.rpm 623 kB/s | 64 kB 00:00 (36/172): lz4-libs-1.8.3-3.el8_4.x86_64.rpm 758 kB/s | 66 kB 00:00 (37/172): p11-kit-0.23.22-1.el8.x86_64.rpm 4.0 MB/s | 324 kB 00:00 (38/172): libcap-ng-0.7.11-1.el8.x86_64.rpm 635 kB/s | 33 kB 00:00 (39/172): filesystem-3.8-6.el8.x86_64.rpm 12 MB/s | 1.1 MB 00:00 (40/172): libdb-5.3.28-42.el8_4.x86_64.rpm 9.8 MB/s | 751 kB 00:00 (41/172): libdb-utils-5.3.28-42.el8_4.x86_64.rp 2.0 MB/s | 150 kB 00:00 (42/172): libxcrypt-4.1.1-6.el8.x86_64.rpm 1.1 MB/s | 73 kB 00:00 (43/172): nettle-3.4.1-7.el8.x86_64.rpm 5.0 MB/s | 301 kB 00:00 (44/172): libxcrypt-devel-4.1.1-6.el8.x86_64.rp 421 kB/s | 25 kB 00:00 (45/172): openldap-2.4.46-18.el8.x86_64.rpm 5.3 MB/s | 352 kB 00:00 (46/172): cyrus-sasl-lib-2.1.27-6.el8_5.x86_64. 1.5 MB/s | 123 kB 00:00 (47/172): pcre-8.42-6.el8.x86_64.rpm 2.4 MB/s | 211 kB 00:00 (48/172): gzip-1.9-13.el8_5.x86_64.rpm 2.3 MB/s | 167 kB 00:00 (49/172): keyutils-libs-1.5.10-9.el8.x86_64.rpm 634 kB/s | 34 kB 00:00 (50/172): libsepol-2.9-3.el8.x86_64.rpm 6.1 MB/s | 340 kB 00:00 (51/172): lua-libs-5.3.4-12.el8.x86_64.rpm 1.6 MB/s | 118 kB 00:00 (52/172): cpio-2.12-11.el8.x86_64.rpm 5.0 MB/s | 266 kB 00:00 (53/172): info-6.5-7.el8.x86_64.rpm 3.3 MB/s | 198 kB 00:00 (54/172): gawk-4.2.1-4.el8.x86_64.rpm 14 MB/s | 1.1 MB 00:00 (55/172): sed-4.5-5.el8.x86_64.rpm 5.7 MB/s | 298 kB 00:00 (56/172): make-4.2.1-11.el8.x86_64.rpm 5.7 MB/s | 498 kB 00:00 (57/172): unzip-6.0-46.el8.x86_64.rpm 3.7 MB/s | 196 kB 00:00 (58/172): xz-5.2.4-4.el8_6.x86_64.rpm 2.2 MB/s | 153 kB 00:00 (59/172): xz-libs-5.2.4-4.el8_6.x86_64.rpm 1.4 MB/s | 94 kB 00:00 (60/172): bash-4.4.20-4.el8_6.x86_64.rpm 19 MB/s | 1.5 MB 00:00 (61/172): gdbm-libs-1.18-2.el8.x86_64.rpm 1.2 MB/s | 60 kB 00:00 (62/172): gnupg2-2.2.20-3.el8_6.x86_64.rpm 34 MB/s | 2.4 MB 00:00 (63/172): libbabeltrace-1.5.4-4.el8.x86_64.rpm 2.7 MB/s | 200 kB 00:00 (64/172): libcom_err-1.45.6-5.el8.x86_64.rpm 995 kB/s | 49 kB 00:00 (65/172): libgcrypt-1.8.5-7.el8_6.x86_64.rpm 7.6 MB/s | 463 kB 00:00 (66/172): libsemanage-2.9-9.el8_6.x86_64.rpm 3.3 MB/s | 168 kB 00:00 (67/172): libtirpc-1.1.4-8.el8.x86_64.rpm 1.8 MB/s | 113 kB 00:00 (68/172): libverto-0.3.2-2.el8.x86_64.rpm 487 kB/s | 24 kB 00:00 (69/172): pcre2-10.32-3.el8_6.x86_64.rpm 4.8 MB/s | 247 kB 00:00 (70/172): gdbm-1.18-2.el8.x86_64.rpm 2.1 MB/s | 130 kB 00:00 (71/172): libksba-1.3.5-9.el8_7.x86_64.rpm 2.3 MB/s | 134 kB 00:00 (72/172): libtasn1-4.13-4.el8_7.x86_64.rpm 1.3 MB/s | 76 kB 00:00 (73/172): coreutils-8.30-15.el8.x86_64.rpm 18 MB/s | 1.2 MB 00:00 (74/172): coreutils-common-8.30-15.el8.x86_64.r 30 MB/s | 2.0 MB 00:00 (75/172): glib2-2.56.4-161.el8.x86_64.rpm 31 MB/s | 2.5 MB 00:00 (76/172): libarchive-3.3.3-5.el8.x86_64.rpm 5.7 MB/s | 360 kB 00:00 (77/172): libffi-3.1-24.el8.x86_64.rpm 705 kB/s | 38 kB 00:00 (78/172): libselinux-2.9-8.el8.x86_64.rpm 3.1 MB/s | 166 kB 00:00 (79/172): platform-python-setuptools-39.2.0-7.e 12 MB/s | 632 kB 00:00 (80/172): libpwquality-1.4.4-6.el8.x86_64.rpm 1.1 MB/s | 107 kB 00:00 (81/172): python3-setuptools-wheel-39.2.0-7.el8 5.6 MB/s | 289 kB 00:00 (82/172): tar-1.30-9.el8.x86_64.rpm 16 MB/s | 839 kB 00:00 (83/172): audit-libs-3.0.7-5.el8.x86_64.rpm 2.3 MB/s | 123 kB 00:00 (84/172): setup-2.12.2-9.el8.noarch.rpm 1.7 MB/s | 181 kB 00:00 (85/172): chkconfig-1.19.2-1.el8.x86_64.rpm 3.6 MB/s | 199 kB 00:00 (86/172): ca-certificates-2023.2.60_v7.0.306-80 14 MB/s | 935 kB 00:00 (87/172): binutils-2.30-123.el8.x86_64.rpm 49 MB/s | 5.8 MB 00:00 (88/172): crypto-policies-20230731-1.git3177e06 1.2 MB/s | 64 kB 00:00 (89/172): elfutils-libelf-0.189-3.el8.x86_64.rp 4.5 MB/s | 232 kB 00:00 (90/172): elfutils-0.189-3.el8.x86_64.rpm 6.7 MB/s | 553 kB 00:00 (91/172): file-5.33-25.el8.x86_64.rpm 1.3 MB/s | 77 kB 00:00 (92/172): elfutils-libs-0.189-3.el8.x86_64.rpm 3.8 MB/s | 303 kB 00:00 (93/172): file-libs-5.33-25.el8.x86_64.rpm 10 MB/s | 544 kB 00:00 (94/172): findutils-4.6.0-21.el8.x86_64.rpm 9.8 MB/s | 527 kB 00:00 (95/172): libgomp-8.5.0-20.el8.x86_64.rpm 3.9 MB/s | 208 kB 00:00 (96/172): libgcc-8.5.0-20.el8.x86_64.rpm 1.3 MB/s | 81 kB 00:00 (97/172): libnghttp2-1.33.0-5.el8_9.x86_64.rpm 1.4 MB/s | 78 kB 00:00 (98/172): libstdc++-8.5.0-20.el8.x86_64.rpm 8.5 MB/s | 455 kB 00:00 (99/172): ncurses-libs-6.1-10.20180224.el8.x86_ 4.7 MB/s | 334 kB 00:00 (100/172): pam-1.3.1-27.el8.x86_64.rpm 13 MB/s | 746 kB 00:00 (101/172): which-2.21-20.el8.x86_64.rpm 915 kB/s | 50 kB 00:00 (102/172): elfutils-default-yama-scope-0.189-3. 887 kB/s | 52 kB 00:00 (103/172): libxml2-2.9.7-18.el8_9.x86_64.rpm 12 MB/s | 697 kB 00:00 (104/172): libcap-2.48-6.el8_9.x86_64.rpm 929 kB/s | 74 kB 00:00 (105/172): krb5-libs-1.18.2-26.el8_9.x86_64.rpm 9.2 MB/s | 842 kB 00:00 (106/172): ncurses-base-6.1-10.20180224.el8.noa 1.5 MB/s | 81 kB 00:00 (107/172): ncurses-6.1-10.20180224.el8.x86_64.r 6.6 MB/s | 387 kB 00:00 (108/172): openssl-libs-1.1.1k-12.el8_9.x86_64. 25 MB/s | 1.5 MB 00:00 (109/172): redhat-release-8.9-0.1.el8.x86_64.rp 931 kB/s | 45 kB 00:00 (110/172): shadow-utils-4.6-19.el8.x86_64.rpm 22 MB/s | 1.2 MB 00:00 (111/172): python3-libs-3.6.8-56.el8_9.3.x86_64 54 MB/s | 7.8 MB 00:00 (112/172): sqlite-libs-3.26.0-19.el8_9.x86_64.r 7.6 MB/s | 581 kB 00:00 (113/172): platform-python-3.6.8-56.el8_9.3.x86 425 kB/s | 87 kB 00:00 (114/172): tpm2-tss-2.3.2-5.el8.x86_64.rpm 5.0 MB/s | 275 kB 00:00 (115/172): systemd-libs-239-78.el8.x86_64.rpm 12 MB/s | 1.1 MB 00:00 (116/172): libssh-config-0.9.6-13.el8_9.noarch. 351 kB/s | 21 kB 00:00 (117/172): zlib-1.2.11-25.el8.x86_64.rpm 973 kB/s | 103 kB 00:00 (118/172): libssh-0.9.6-13.el8_9.x86_64.rpm 2.2 MB/s | 220 kB 00:00 (119/172): rpm-4.14.3-28.el8_9.x86_64.rpm 6.5 MB/s | 544 kB 00:00 (120/172): rpm-build-libs-4.14.3-28.el8_9.x86_6 1.9 MB/s | 157 kB 00:00 (121/172): rpm-libs-4.14.3-28.el8_9.x86_64.rpm 5.7 MB/s | 348 kB 00:00 (122/172): glibc-2.28-236.el8_9.12.x86_64.rpm 37 MB/s | 2.2 MB 00:00 (123/172): tzdata-2024a-1.el8.noarch.rpm 3.8 MB/s | 475 kB 00:00 (124/172): glibc-common-2.28-236.el8_9.12.x86_6 17 MB/s | 1.0 MB 00:00 (125/172): glibc-all-langpacks-2.28-236.el8_9.1 145 MB/s | 26 MB 00:00 (126/172): glibc-devel-2.28-236.el8_9.12.x86_64 1.3 MB/s | 86 kB 00:00 (127/172): glibc-gconv-extra-2.28-236.el8_9.12. 24 MB/s | 1.6 MB 00:00 (128/172): glibc-headers-2.28-236.el8_9.12.x86_ 9.2 MB/s | 491 kB 00:00 (129/172): curl-7.61.1-33.el8_9.5.x86_64.rpm 6.5 MB/s | 354 kB 00:00 (130/172): kernel-headers-4.18.0-513.24.1.el8_9 98 MB/s | 11 MB 00:00 (131/172): libblkid-2.32.1-44.el8_9.1.x86_64.rp 3.0 MB/s | 221 kB 00:00 (132/172): libcurl-7.61.1-33.el8_9.5.x86_64.rpm 4.1 MB/s | 304 kB 00:00 (133/172): libmount-2.32.1-44.el8_9.1.x86_64.rp 4.6 MB/s | 237 kB 00:00 (134/172): libfdisk-2.32.1-44.el8_9.1.x86_64.rp 4.7 MB/s | 254 kB 00:00 (135/172): libsmartcols-2.32.1-44.el8_9.1.x86_6 3.3 MB/s | 180 kB 00:00 (136/172): libuuid-2.32.1-44.el8_9.1.x86_64.rpm 1.9 MB/s | 99 kB 00:00 (137/172): python3-pip-wheel-9.0.3-23.el8_9.1.n 15 MB/s | 866 kB 00:00 (138/172): util-linux-2.32.1-44.el8_9.1.x86_64. 40 MB/s | 2.5 MB 00:00 (139/172): expat-2.2.5-11.el8_9.1.x86_64.rpm 2.2 MB/s | 114 kB 00:00 (140/172): ghc-srpm-macros-1.4.2-7.el8.noarch.r 190 kB/s | 9.4 kB 00:00 (141/172): gnutls-3.6.16-8.el8_9.3.x86_64.rpm 15 MB/s | 1.0 MB 00:00 (142/172): ocaml-srpm-macros-5-4.el8.noarch.rpm 190 kB/s | 9.5 kB 00:00 (143/172): openblas-srpm-macros-2-2.el8.noarch. 156 kB/s | 8.0 kB 00:00 (144/172): perl-srpm-macros-1-25.el8.noarch.rpm 168 kB/s | 11 kB 00:00 (145/172): rust-srpm-macros-5-2.el8.noarch.rpm 185 kB/s | 9.3 kB 00:00 (146/172): libatomic_ops-7.6.2-3.el8.x86_64.rpm 535 kB/s | 38 kB 00:00 (147/172): gc-7.6.4-3.el8.x86_64.rpm 1.7 MB/s | 109 kB 00:00 (148/172): libipt-1.6.1-8.el8.x86_64.rpm 927 kB/s | 50 kB 00:00 (149/172): isl-0.16.1-6.el8.x86_64.rpm 12 MB/s | 841 kB 00:00 (150/172): guile-2.0.14-7.el8.x86_64.rpm 31 MB/s | 3.5 MB 00:00 (151/172): libmpc-1.1.0-9.1.el8.x86_64.rpm 1.2 MB/s | 61 kB 00:00 (152/172): efi-srpm-macros-3-3.el8.noarch.rpm 455 kB/s | 22 kB 00:00 (153/172): zstd-1.4.4-1.el8.x86_64.rpm 5.6 MB/s | 393 kB 00:00 (154/172): go-srpm-macros-2-17.el8.noarch.rpm 255 kB/s | 13 kB 00:00 (155/172): qt5-srpm-macros-5.15.3-1.el8.noarch. 211 kB/s | 11 kB 00:00 (156/172): python-rpm-macros-3-45.el8.noarch.rp 308 kB/s | 16 kB 00:00 (157/172): python3-rpm-macros-3-45.el8.noarch.r 230 kB/s | 15 kB 00:00 (158/172): redhat-rpm-config-131-1.el8.noarch.r 1.8 MB/s | 91 kB 00:00 (159/172): python-srpm-macros-3-45.el8.noarch.r 301 kB/s | 16 kB 00:00 (160/172): dwz-0.12-10.el8.x86_64.rpm 416 kB/s | 109 kB 00:00 (161/172): gcc-c++-8.5.0-20.el8.x86_64.rpm 94 MB/s | 12 MB 00:00 (162/172): gcc-plugin-annobin-8.5.0-20.el8.x86_ 365 kB/s | 36 kB 00:00 (163/172): annobin-11.13-2.el8.x86_64.rpm 18 MB/s | 972 kB 00:00 (164/172): cpp-8.5.0-20.el8.x86_64.rpm 102 MB/s | 10 MB 00:00 (165/172): gdb-headless-8.2-20.el8.x86_64.rpm 48 MB/s | 3.7 MB 00:00 (166/172): rpm-build-4.14.3-28.el8_9.x86_64.rpm 2.9 MB/s | 174 kB 00:00 (167/172): libstdc++-devel-8.5.0-20.el8.x86_64. 25 MB/s | 2.1 MB 00:00 (168/172): ansible-srpm-macros-1-12.el8.noarch. 1.7 MB/s | 21 kB 00:00 (169/172): fpc-srpm-macros-1.3-1.el8.noarch.rpm 1.8 MB/s | 8.2 kB 00:00 (170/172): lua-srpm-macros-1-13.el8.noarch.rpm 4.3 MB/s | 9.2 kB 00:00 (171/172): epel-rpm-macros-8-41.noarch.rpm 3.0 MB/s | 27 kB 00:00 (172/172): gcc-8.5.0-20.el8.x86_64.rpm 104 MB/s | 23 MB 00:00 -------------------------------------------------------------------------------- Total 38 MB/s | 163 MB 00:04 Red Hat Enterprise Linux - BaseOS 3.1 MB/s | 3.1 kB 00:00 Importing GPG key 0xFD431D51: Userid : "Red Hat, Inc. (release key 2) " Fingerprint: 567E 347A D004 4ADE 55BA 8A5F 199E 2F91 FD43 1D51 From : /usr/share/distribution-gpg-keys/redhat/RPM-GPG-KEY-redhat8-release Key imported successfully Importing GPG key 0x2FA658E0: Userid : "Red Hat, Inc. (auxiliary key) " Fingerprint: 43A6 E49C 4A38 F4BE 9ABF 2A53 4568 9C88 2FA6 58E0 From : /usr/share/distribution-gpg-keys/redhat/RPM-GPG-KEY-redhat8-release Key imported successfully Extra Packages for Enterprise Linux 8 - x86_64 1.6 MB/s | 1.6 kB 00:00 Importing GPG key 0x2F86D6A1: Userid : "Fedora EPEL (8) " Fingerprint: 94E2 79EB 8D8F 25B2 1810 ADF1 21EA 45AB 2F86 D6A1 From : /usr/share/distribution-gpg-keys/epel/RPM-GPG-KEY-EPEL-8 Key imported successfully Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Running scriptlet: filesystem-3.8-6.el8.x86_64 1/1 Preparing : 1/1 Installing : libgcc-8.5.0-20.el8.x86_64 1/172 Running scriptlet: libgcc-8.5.0-20.el8.x86_64 1/172 Installing : python-srpm-macros-3-45.el8.noarch 2/172 Installing : crypto-policies-20230731-1.git3177e06.el8.noarch 3/172 Running scriptlet: crypto-policies-20230731-1.git3177e06.el8.noarch 3/172 Installing : python-rpm-macros-3-45.el8.noarch 4/172 Installing : python3-pip-wheel-9.0.3-23.el8_9.1.noarch 5/172 Installing : redhat-release-8.9-0.1.el8.x86_64 6/172 Installing : setup-2.12.2-9.el8.noarch 7/172 warning: /etc/hosts created as /etc/hosts.rpmnew Running scriptlet: setup-2.12.2-9.el8.noarch 7/172 Installing : filesystem-3.8-6.el8.x86_64 8/172 Installing : python3-setuptools-wheel-39.2.0-7.el8.noarch 9/172 Installing : basesystem-11-5.el8.noarch 10/172 Installing : python3-rpm-macros-3-45.el8.noarch 11/172 Installing : fpc-srpm-macros-1.3-1.el8.noarch 12/172 Installing : ansible-srpm-macros-1-12.el8.noarch 13/172 Installing : qt5-srpm-macros-5.15.3-1.el8.noarch 14/172 Installing : go-srpm-macros-2-17.el8.noarch 15/172 Installing : rust-srpm-macros-5-2.el8.noarch 16/172 Installing : perl-srpm-macros-1-25.el8.noarch 17/172 Installing : openblas-srpm-macros-2-2.el8.noarch 18/172 Installing : ocaml-srpm-macros-5-4.el8.noarch 19/172 Installing : ghc-srpm-macros-1.4.2-7.el8.noarch 20/172 Installing : kernel-headers-4.18.0-513.24.1.el8_9.x86_64 21/172 Installing : tzdata-2024a-1.el8.noarch 22/172 Installing : libssh-config-0.9.6-13.el8_9.noarch 23/172 Installing : ncurses-base-6.1-10.20180224.el8.noarch 24/172 Installing : pcre2-10.32-3.el8_6.x86_64 25/172 Installing : libselinux-2.9-8.el8.x86_64 26/172 Installing : ncurses-libs-6.1-10.20180224.el8.x86_64 27/172 Installing : glibc-all-langpacks-2.28-236.el8_9.12.x86_64 28/172 Installing : glibc-common-2.28-236.el8_9.12.x86_64 29/172 Installing : glibc-gconv-extra-2.28-236.el8_9.12.x86_64 30/172 Running scriptlet: glibc-gconv-extra-2.28-236.el8_9.12.x86_64 30/172 Running scriptlet: glibc-2.28-236.el8_9.12.x86_64 31/172 Installing : glibc-2.28-236.el8_9.12.x86_64 31/172 Running scriptlet: glibc-2.28-236.el8_9.12.x86_64 31/172 Installing : bash-4.4.20-4.el8_6.x86_64 32/172 Running scriptlet: bash-4.4.20-4.el8_6.x86_64 32/172 Installing : libsepol-2.9-3.el8.x86_64 33/172 Running scriptlet: libsepol-2.9-3.el8.x86_64 33/172 Installing : zlib-1.2.11-25.el8.x86_64 34/172 Installing : info-6.5-7.el8.x86_64 35/172 Installing : bzip2-libs-1.0.6-26.el8.x86_64 36/172 Installing : xz-libs-5.2.4-4.el8_6.x86_64 37/172 Installing : gmp-1:6.1.2-10.el8.x86_64 38/172 Running scriptlet: gmp-1:6.1.2-10.el8.x86_64 38/172 Installing : libstdc++-8.5.0-20.el8.x86_64 39/172 Running scriptlet: libstdc++-8.5.0-20.el8.x86_64 39/172 Installing : libzstd-1.4.4-1.el8.x86_64 40/172 Installing : elfutils-libelf-0.189-3.el8.x86_64 41/172 Installing : libxcrypt-4.1.1-6.el8.x86_64 42/172 Installing : mpfr-3.1.6-1.el8.x86_64 43/172 Running scriptlet: mpfr-3.1.6-1.el8.x86_64 43/172 Installing : readline-7.0-10.el8.x86_64 44/172 Running scriptlet: readline-7.0-10.el8.x86_64 44/172 Installing : sqlite-libs-3.26.0-19.el8_9.x86_64 45/172 Installing : popt-1.18-1.el8.x86_64 46/172 Installing : libcap-2.48-6.el8_9.x86_64 47/172 Installing : libcom_err-1.45.6-5.el8.x86_64 48/172 Running scriptlet: libcom_err-1.45.6-5.el8.x86_64 48/172 Installing : libuuid-2.32.1-44.el8_9.1.x86_64 49/172 Running scriptlet: libuuid-2.32.1-44.el8_9.1.x86_64 49/172 Installing : chkconfig-1.19.2-1.el8.x86_64 50/172 Installing : libunistring-0.9.9-3.el8.x86_64 51/172 Installing : libattr-2.4.48-3.el8.x86_64 52/172 Installing : libacl-2.2.53-1.el8.x86_64 53/172 Installing : sed-4.5-5.el8.x86_64 54/172 Running scriptlet: sed-4.5-5.el8.x86_64 54/172 Installing : libgpg-error-1.31-1.el8.x86_64 55/172 Installing : lua-libs-5.3.4-12.el8.x86_64 56/172 Installing : libffi-3.1-24.el8.x86_64 57/172 Installing : p11-kit-0.23.22-1.el8.x86_64 58/172 Installing : libidn2-2.2.0-1.el8.x86_64 59/172 Installing : libmpc-1.1.0-9.1.el8.x86_64 60/172 Installing : file-libs-5.33-25.el8.x86_64 61/172 Installing : file-5.33-25.el8.x86_64 62/172 Installing : libgcrypt-1.8.5-7.el8_6.x86_64 63/172 Running scriptlet: libgcrypt-1.8.5-7.el8_6.x86_64 63/172 Installing : unzip-6.0-46.el8.x86_64 64/172 Installing : findutils-1:4.6.0-21.el8.x86_64 65/172 Running scriptlet: findutils-1:4.6.0-21.el8.x86_64 65/172 Installing : elfutils-default-yama-scope-0.189-3.el8.noarch 66/172 Running scriptlet: elfutils-default-yama-scope-0.189-3.el8.noarch 66/172 Installing : elfutils-libs-0.189-3.el8.x86_64 67/172 Running scriptlet: glibc-headers-2.28-236.el8_9.12.x86_64 68/172 Installing : glibc-headers-2.28-236.el8_9.12.x86_64 68/172 Installing : lz4-libs-1.8.3-3.el8_4.x86_64 69/172 Installing : libcap-ng-0.7.11-1.el8.x86_64 70/172 Installing : audit-libs-3.0.7-5.el8.x86_64 71/172 Installing : pcre-8.42-6.el8.x86_64 72/172 Installing : grep-3.1-6.el8.x86_64 73/172 Running scriptlet: grep-3.1-6.el8.x86_64 73/172 Installing : keyutils-libs-1.5.10-9.el8.x86_64 74/172 Installing : gdbm-libs-1:1.18-2.el8.x86_64 75/172 Installing : libtasn1-4.13-4.el8_7.x86_64 76/172 Running scriptlet: libtasn1-4.13-4.el8_7.x86_64 76/172 Installing : p11-kit-trust-0.23.22-1.el8.x86_64 77/172 Running scriptlet: p11-kit-trust-0.23.22-1.el8.x86_64 77/172 Installing : expat-2.2.5-11.el8_9.1.x86_64 78/172 Installing : gdbm-1:1.18-2.el8.x86_64 79/172 Installing : xz-5.2.4-4.el8_6.x86_64 80/172 Installing : libsemanage-2.9-9.el8_6.x86_64 81/172 Installing : elfutils-0.189-3.el8.x86_64 82/172 Installing : zip-3.0-23.el8.x86_64 83/172 Installing : cpp-8.5.0-20.el8.x86_64 84/172 Running scriptlet: cpp-8.5.0-20.el8.x86_64 84/172 Installing : libassuan-2.5.1-3.el8.x86_64 85/172 Installing : libksba-1.3.5-9.el8_7.x86_64 86/172 Installing : tar-2:1.30-9.el8.x86_64 87/172 Running scriptlet: tar-2:1.30-9.el8.x86_64 87/172 Installing : patch-2.7.6-11.el8.x86_64 88/172 Installing : dwz-0.12-10.el8.x86_64 89/172 Installing : zstd-1.4.4-1.el8.x86_64 90/172 Installing : libstdc++-devel-8.5.0-20.el8.x86_64 91/172 Installing : nettle-3.4.1-7.el8.x86_64 92/172 Running scriptlet: nettle-3.4.1-7.el8.x86_64 92/172 Installing : gnutls-3.6.16-8.el8_9.3.x86_64 93/172 Installing : isl-0.16.1-6.el8.x86_64 94/172 Running scriptlet: isl-0.16.1-6.el8.x86_64 94/172 Installing : libxml2-2.9.7-18.el8_9.x86_64 95/172 Installing : bzip2-1.0.6-26.el8.x86_64 96/172 Installing : diffutils-3.6-6.el8.x86_64 97/172 Running scriptlet: diffutils-3.6-6.el8.x86_64 97/172 Installing : coreutils-common-8.30-15.el8.x86_64 98/172 Running scriptlet: coreutils-common-8.30-15.el8.x86_64 98/172 Installing : libgomp-8.5.0-20.el8.x86_64 99/172 Running scriptlet: libgomp-8.5.0-20.el8.x86_64 99/172 Installing : libsigsegv-2.11-5.el8.x86_64 100/172 Installing : gawk-4.2.1-4.el8.x86_64 101/172 Installing : npth-1.5-4.el8.x86_64 102/172 Installing : libpkgconf-1.4.2-1.el8.x86_64 103/172 Installing : pkgconf-1.4.2-1.el8.x86_64 104/172 Installing : libtool-ltdl-2.4.6-25.el8.x86_64 105/172 Running scriptlet: libtool-ltdl-2.4.6-25.el8.x86_64 105/172 Installing : brotli-1.0.6-3.el8.x86_64 106/172 Installing : cpio-2.12-11.el8.x86_64 107/172 Installing : libverto-0.3.2-2.el8.x86_64 108/172 Installing : libnghttp2-1.33.0-5.el8_9.x86_64 109/172 Installing : ncurses-6.1-10.20180224.el8.x86_64 110/172 Installing : openssl-libs-1:1.1.1k-12.el8_9.x86_64 111/172 Running scriptlet: openssl-libs-1:1.1.1k-12.el8_9.x86_64 111/172 Installing : coreutils-8.30-15.el8.x86_64 112/172 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 113/172 Installing : ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 113/172 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 113/172 Installing : libdb-5.3.28-42.el8_4.x86_64 114/172 Running scriptlet: libdb-5.3.28-42.el8_4.x86_64 114/172 Installing : krb5-libs-1.18.2-26.el8_9.x86_64 115/172 Installing : libtirpc-1.1.4-8.el8.x86_64 116/172 Running scriptlet: libtirpc-1.1.4-8.el8.x86_64 116/172 Installing : libblkid-2.32.1-44.el8_9.1.x86_64 117/172 Running scriptlet: libblkid-2.32.1-44.el8_9.1.x86_64 117/172 Installing : libmount-2.32.1-44.el8_9.1.x86_64 118/172 Running scriptlet: libmount-2.32.1-44.el8_9.1.x86_64 118/172 Installing : systemd-libs-239-78.el8.x86_64 119/172 Running scriptlet: systemd-libs-239-78.el8.x86_64 119/172 Installing : libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 120/172 Running scriptlet: libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 120/172 Installing : platform-python-setuptools-39.2.0-7.el8.noarch 121/172 Installing : platform-python-3.6.8-56.el8_9.3.x86_64 122/172 Running scriptlet: platform-python-3.6.8-56.el8_9.3.x86_64 122/172 Installing : python3-libs-3.6.8-56.el8_9.3.x86_64 123/172 Installing : gzip-1.9-13.el8_5.x86_64 124/172 Running scriptlet: gzip-1.9-13.el8_5.x86_64 124/172 Installing : cracklib-2.9.6-15.el8.x86_64 125/172 Installing : cracklib-dicts-2.9.6-15.el8.x86_64 126/172 Installing : binutils-2.30-123.el8.x86_64 127/172 Running scriptlet: binutils-2.30-123.el8.x86_64 127/172 Installing : shadow-utils-2:4.6-19.el8.x86_64 128/172 Running scriptlet: libutempter-1.1.6-14.el8.x86_64 129/172 Installing : libutempter-1.1.6-14.el8.x86_64 129/172 Running scriptlet: tpm2-tss-2.3.2-5.el8.x86_64 130/172 Installing : tpm2-tss-2.3.2-5.el8.x86_64 130/172 Running scriptlet: tpm2-tss-2.3.2-5.el8.x86_64 130/172 Installing : ima-evm-utils-1.3.2-12.el8.x86_64 131/172 Installing : libpwquality-1.4.4-6.el8.x86_64 132/172 Installing : pam-1.3.1-27.el8.x86_64 133/172 Running scriptlet: pam-1.3.1-27.el8.x86_64 133/172 Installing : libusbx-1.0.23-4.el8.x86_64 134/172 Installing : glib2-2.56.4-161.el8.x86_64 135/172 Installing : libbabeltrace-1.5.4-4.el8.x86_64 136/172 Running scriptlet: libbabeltrace-1.5.4-4.el8.x86_64 136/172 Installing : libfdisk-2.32.1-44.el8_9.1.x86_64 137/172 Running scriptlet: libfdisk-2.32.1-44.el8_9.1.x86_64 137/172 Installing : cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 138/172 Running scriptlet: cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 138/172 Installing : openldap-2.4.46-18.el8.x86_64 139/172 Installing : gnupg2-2.2.20-3.el8_6.x86_64 140/172 Installing : libssh-0.9.6-13.el8_9.x86_64 141/172 Installing : libdb-utils-5.3.28-42.el8_4.x86_64 142/172 Installing : libarchive-3.3.3-5.el8.x86_64 143/172 Installing : libsmartcols-2.32.1-44.el8_9.1.x86_64 144/172 Running scriptlet: libsmartcols-2.32.1-44.el8_9.1.x86_64 144/172 Installing : libatomic_ops-7.6.2-3.el8.x86_64 145/172 Installing : gc-7.6.4-3.el8.x86_64 146/172 Installing : guile-5:2.0.14-7.el8.x86_64 147/172 Running scriptlet: guile-5:2.0.14-7.el8.x86_64 147/172 Installing : libipt-1.6.1-8.el8.x86_64 148/172 Installing : publicsuffix-list-dafsa-20180723-1.el8.noarch 149/172 Installing : libpsl-0.20.2-6.el8.x86_64 150/172 Installing : libcurl-7.61.1-33.el8_9.5.x86_64 151/172 Installing : curl-7.61.1-33.el8_9.5.x86_64 152/172 Installing : rpm-4.14.3-28.el8_9.x86_64 153/172 Installing : rpm-libs-4.14.3-28.el8_9.x86_64 154/172 Running scriptlet: rpm-libs-4.14.3-28.el8_9.x86_64 154/172 Installing : rpm-build-libs-4.14.3-28.el8_9.x86_64 155/172 Running scriptlet: rpm-build-libs-4.14.3-28.el8_9.x86_64 155/172 Installing : gdb-headless-8.2-20.el8.x86_64 156/172 Installing : efi-srpm-macros-3-3.el8.noarch 157/172 Installing : lua-srpm-macros-1-13.el8.noarch 158/172 Installing : pkgconf-m4-1.4.2-1.el8.noarch 159/172 Installing : pkgconf-pkg-config-1.4.2-1.el8.x86_64 160/172 Installing : glibc-devel-2.28-236.el8_9.12.x86_64 161/172 Running scriptlet: glibc-devel-2.28-236.el8_9.12.x86_64 161/172 Installing : libxcrypt-devel-4.1.1-6.el8.x86_64 162/172 Installing : gcc-8.5.0-20.el8.x86_64 163/172 Running scriptlet: gcc-8.5.0-20.el8.x86_64 163/172 Installing : gcc-plugin-annobin-8.5.0-20.el8.x86_64 164/172 Installing : annobin-11.13-2.el8.x86_64 165/172 Installing : redhat-rpm-config-131-1.el8.noarch 166/172 Running scriptlet: redhat-rpm-config-131-1.el8.noarch 166/172 Installing : rpm-build-4.14.3-28.el8_9.x86_64 167/172 Installing : gcc-c++-8.5.0-20.el8.x86_64 168/172 Installing : epel-rpm-macros-8-41.noarch 169/172 Installing : util-linux-2.32.1-44.el8_9.1.x86_64 170/172 Running scriptlet: util-linux-2.32.1-44.el8_9.1.x86_64 170/172 Installing : which-2.21-20.el8.x86_64 171/172 Installing : make-1:4.2.1-11.el8.x86_64 172/172 Running scriptlet: make-1:4.2.1-11.el8.x86_64 172/172 Running scriptlet: filesystem-3.8-6.el8.x86_64 172/172 Running scriptlet: glibc-all-langpacks-2.28-236.el8_9.12.x86_64 172/172 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 172/172 Running scriptlet: guile-5:2.0.14-7.el8.x86_64 172/172 Running scriptlet: glibc-common-2.28-236.el8_9.12.x86_64 172/172 Running scriptlet: info-6.5-7.el8.x86_64 172/172 Running scriptlet: glib2-2.56.4-161.el8.x86_64 172/172 Verifying : bzip2-1.0.6-26.el8.x86_64 1/172 Verifying : bzip2-libs-1.0.6-26.el8.x86_64 2/172 Verifying : cracklib-2.9.6-15.el8.x86_64 3/172 Verifying : cracklib-dicts-2.9.6-15.el8.x86_64 4/172 Verifying : grep-3.1-6.el8.x86_64 5/172 Verifying : libassuan-2.5.1-3.el8.x86_64 6/172 Verifying : libattr-2.4.48-3.el8.x86_64 7/172 Verifying : libsigsegv-2.11-5.el8.x86_64 8/172 Verifying : libunistring-0.9.9-3.el8.x86_64 9/172 Verifying : libutempter-1.1.6-14.el8.x86_64 10/172 Verifying : mpfr-3.1.6-1.el8.x86_64 11/172 Verifying : npth-1.5-4.el8.x86_64 12/172 Verifying : pkgconf-1.4.2-1.el8.x86_64 13/172 Verifying : pkgconf-pkg-config-1.4.2-1.el8.x86_64 14/172 Verifying : readline-7.0-10.el8.x86_64 15/172 Verifying : zip-3.0-23.el8.x86_64 16/172 Verifying : basesystem-11-5.el8.noarch 17/172 Verifying : libacl-2.2.53-1.el8.x86_64 18/172 Verifying : libgpg-error-1.31-1.el8.x86_64 19/172 Verifying : libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 20/172 Verifying : libpkgconf-1.4.2-1.el8.x86_64 21/172 Verifying : libtool-ltdl-2.4.6-25.el8.x86_64 22/172 Verifying : pkgconf-m4-1.4.2-1.el8.noarch 23/172 Verifying : publicsuffix-list-dafsa-20180723-1.el8.noarch 24/172 Verifying : gmp-1:6.1.2-10.el8.x86_64 25/172 Verifying : diffutils-3.6-6.el8.x86_64 26/172 Verifying : libidn2-2.2.0-1.el8.x86_64 27/172 Verifying : patch-2.7.6-11.el8.x86_64 28/172 Verifying : libusbx-1.0.23-4.el8.x86_64 29/172 Verifying : libzstd-1.4.4-1.el8.x86_64 30/172 Verifying : libpsl-0.20.2-6.el8.x86_64 31/172 Verifying : p11-kit-trust-0.23.22-1.el8.x86_64 32/172 Verifying : popt-1.18-1.el8.x86_64 33/172 Verifying : brotli-1.0.6-3.el8.x86_64 34/172 Verifying : ima-evm-utils-1.3.2-12.el8.x86_64 35/172 Verifying : lz4-libs-1.8.3-3.el8_4.x86_64 36/172 Verifying : p11-kit-0.23.22-1.el8.x86_64 37/172 Verifying : filesystem-3.8-6.el8.x86_64 38/172 Verifying : libcap-ng-0.7.11-1.el8.x86_64 39/172 Verifying : libdb-5.3.28-42.el8_4.x86_64 40/172 Verifying : libdb-utils-5.3.28-42.el8_4.x86_64 41/172 Verifying : libxcrypt-4.1.1-6.el8.x86_64 42/172 Verifying : libxcrypt-devel-4.1.1-6.el8.x86_64 43/172 Verifying : nettle-3.4.1-7.el8.x86_64 44/172 Verifying : openldap-2.4.46-18.el8.x86_64 45/172 Verifying : pcre-8.42-6.el8.x86_64 46/172 Verifying : cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 47/172 Verifying : gzip-1.9-13.el8_5.x86_64 48/172 Verifying : keyutils-libs-1.5.10-9.el8.x86_64 49/172 Verifying : libsepol-2.9-3.el8.x86_64 50/172 Verifying : lua-libs-5.3.4-12.el8.x86_64 51/172 Verifying : cpio-2.12-11.el8.x86_64 52/172 Verifying : gawk-4.2.1-4.el8.x86_64 53/172 Verifying : info-6.5-7.el8.x86_64 54/172 Verifying : make-1:4.2.1-11.el8.x86_64 55/172 Verifying : sed-4.5-5.el8.x86_64 56/172 Verifying : unzip-6.0-46.el8.x86_64 57/172 Verifying : xz-5.2.4-4.el8_6.x86_64 58/172 Verifying : xz-libs-5.2.4-4.el8_6.x86_64 59/172 Verifying : bash-4.4.20-4.el8_6.x86_64 60/172 Verifying : gdbm-libs-1:1.18-2.el8.x86_64 61/172 Verifying : gnupg2-2.2.20-3.el8_6.x86_64 62/172 Verifying : libbabeltrace-1.5.4-4.el8.x86_64 63/172 Verifying : libcom_err-1.45.6-5.el8.x86_64 64/172 Verifying : libgcrypt-1.8.5-7.el8_6.x86_64 65/172 Verifying : libsemanage-2.9-9.el8_6.x86_64 66/172 Verifying : libtirpc-1.1.4-8.el8.x86_64 67/172 Verifying : libverto-0.3.2-2.el8.x86_64 68/172 Verifying : pcre2-10.32-3.el8_6.x86_64 69/172 Verifying : gdbm-1:1.18-2.el8.x86_64 70/172 Verifying : libksba-1.3.5-9.el8_7.x86_64 71/172 Verifying : libtasn1-4.13-4.el8_7.x86_64 72/172 Verifying : coreutils-8.30-15.el8.x86_64 73/172 Verifying : coreutils-common-8.30-15.el8.x86_64 74/172 Verifying : glib2-2.56.4-161.el8.x86_64 75/172 Verifying : libarchive-3.3.3-5.el8.x86_64 76/172 Verifying : libffi-3.1-24.el8.x86_64 77/172 Verifying : libpwquality-1.4.4-6.el8.x86_64 78/172 Verifying : libselinux-2.9-8.el8.x86_64 79/172 Verifying : platform-python-setuptools-39.2.0-7.el8.noarch 80/172 Verifying : python3-setuptools-wheel-39.2.0-7.el8.noarch 81/172 Verifying : setup-2.12.2-9.el8.noarch 82/172 Verifying : tar-2:1.30-9.el8.x86_64 83/172 Verifying : audit-libs-3.0.7-5.el8.x86_64 84/172 Verifying : binutils-2.30-123.el8.x86_64 85/172 Verifying : ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 86/172 Verifying : chkconfig-1.19.2-1.el8.x86_64 87/172 Verifying : crypto-policies-20230731-1.git3177e06.el8.noarch 88/172 Verifying : elfutils-0.189-3.el8.x86_64 89/172 Verifying : elfutils-libelf-0.189-3.el8.x86_64 90/172 Verifying : elfutils-libs-0.189-3.el8.x86_64 91/172 Verifying : file-5.33-25.el8.x86_64 92/172 Verifying : file-libs-5.33-25.el8.x86_64 93/172 Verifying : findutils-1:4.6.0-21.el8.x86_64 94/172 Verifying : libgcc-8.5.0-20.el8.x86_64 95/172 Verifying : libgomp-8.5.0-20.el8.x86_64 96/172 Verifying : libnghttp2-1.33.0-5.el8_9.x86_64 97/172 Verifying : libstdc++-8.5.0-20.el8.x86_64 98/172 Verifying : ncurses-libs-6.1-10.20180224.el8.x86_64 99/172 Verifying : pam-1.3.1-27.el8.x86_64 100/172 Verifying : which-2.21-20.el8.x86_64 101/172 Verifying : elfutils-default-yama-scope-0.189-3.el8.noarch 102/172 Verifying : krb5-libs-1.18.2-26.el8_9.x86_64 103/172 Verifying : libcap-2.48-6.el8_9.x86_64 104/172 Verifying : libxml2-2.9.7-18.el8_9.x86_64 105/172 Verifying : ncurses-6.1-10.20180224.el8.x86_64 106/172 Verifying : ncurses-base-6.1-10.20180224.el8.noarch 107/172 Verifying : openssl-libs-1:1.1.1k-12.el8_9.x86_64 108/172 Verifying : platform-python-3.6.8-56.el8_9.3.x86_64 109/172 Verifying : python3-libs-3.6.8-56.el8_9.3.x86_64 110/172 Verifying : redhat-release-8.9-0.1.el8.x86_64 111/172 Verifying : shadow-utils-2:4.6-19.el8.x86_64 112/172 Verifying : sqlite-libs-3.26.0-19.el8_9.x86_64 113/172 Verifying : systemd-libs-239-78.el8.x86_64 114/172 Verifying : tpm2-tss-2.3.2-5.el8.x86_64 115/172 Verifying : zlib-1.2.11-25.el8.x86_64 116/172 Verifying : libssh-0.9.6-13.el8_9.x86_64 117/172 Verifying : libssh-config-0.9.6-13.el8_9.noarch 118/172 Verifying : rpm-4.14.3-28.el8_9.x86_64 119/172 Verifying : rpm-build-libs-4.14.3-28.el8_9.x86_64 120/172 Verifying : rpm-libs-4.14.3-28.el8_9.x86_64 121/172 Verifying : tzdata-2024a-1.el8.noarch 122/172 Verifying : glibc-2.28-236.el8_9.12.x86_64 123/172 Verifying : glibc-all-langpacks-2.28-236.el8_9.12.x86_64 124/172 Verifying : glibc-common-2.28-236.el8_9.12.x86_64 125/172 Verifying : glibc-devel-2.28-236.el8_9.12.x86_64 126/172 Verifying : glibc-gconv-extra-2.28-236.el8_9.12.x86_64 127/172 Verifying : glibc-headers-2.28-236.el8_9.12.x86_64 128/172 Verifying : curl-7.61.1-33.el8_9.5.x86_64 129/172 Verifying : kernel-headers-4.18.0-513.24.1.el8_9.x86_64 130/172 Verifying : libblkid-2.32.1-44.el8_9.1.x86_64 131/172 Verifying : libcurl-7.61.1-33.el8_9.5.x86_64 132/172 Verifying : libfdisk-2.32.1-44.el8_9.1.x86_64 133/172 Verifying : libmount-2.32.1-44.el8_9.1.x86_64 134/172 Verifying : libsmartcols-2.32.1-44.el8_9.1.x86_64 135/172 Verifying : libuuid-2.32.1-44.el8_9.1.x86_64 136/172 Verifying : python3-pip-wheel-9.0.3-23.el8_9.1.noarch 137/172 Verifying : util-linux-2.32.1-44.el8_9.1.x86_64 138/172 Verifying : expat-2.2.5-11.el8_9.1.x86_64 139/172 Verifying : gnutls-3.6.16-8.el8_9.3.x86_64 140/172 Verifying : ghc-srpm-macros-1.4.2-7.el8.noarch 141/172 Verifying : ocaml-srpm-macros-5-4.el8.noarch 142/172 Verifying : openblas-srpm-macros-2-2.el8.noarch 143/172 Verifying : perl-srpm-macros-1-25.el8.noarch 144/172 Verifying : rust-srpm-macros-5-2.el8.noarch 145/172 Verifying : libatomic_ops-7.6.2-3.el8.x86_64 146/172 Verifying : gc-7.6.4-3.el8.x86_64 147/172 Verifying : guile-5:2.0.14-7.el8.x86_64 148/172 Verifying : isl-0.16.1-6.el8.x86_64 149/172 Verifying : libipt-1.6.1-8.el8.x86_64 150/172 Verifying : zstd-1.4.4-1.el8.x86_64 151/172 Verifying : libmpc-1.1.0-9.1.el8.x86_64 152/172 Verifying : efi-srpm-macros-3-3.el8.noarch 153/172 Verifying : go-srpm-macros-2-17.el8.noarch 154/172 Verifying : dwz-0.12-10.el8.x86_64 155/172 Verifying : qt5-srpm-macros-5.15.3-1.el8.noarch 156/172 Verifying : python-rpm-macros-3-45.el8.noarch 157/172 Verifying : python3-rpm-macros-3-45.el8.noarch 158/172 Verifying : redhat-rpm-config-131-1.el8.noarch 159/172 Verifying : python-srpm-macros-3-45.el8.noarch 160/172 Verifying : gcc-c++-8.5.0-20.el8.x86_64 161/172 Verifying : gcc-plugin-annobin-8.5.0-20.el8.x86_64 162/172 Verifying : annobin-11.13-2.el8.x86_64 163/172 Verifying : cpp-8.5.0-20.el8.x86_64 164/172 Verifying : gcc-8.5.0-20.el8.x86_64 165/172 Verifying : gdb-headless-8.2-20.el8.x86_64 166/172 Verifying : libstdc++-devel-8.5.0-20.el8.x86_64 167/172 Verifying : rpm-build-4.14.3-28.el8_9.x86_64 168/172 Verifying : ansible-srpm-macros-1-12.el8.noarch 169/172 Verifying : epel-rpm-macros-8-41.noarch 170/172 Verifying : fpc-srpm-macros-1.3-1.el8.noarch 171/172 Verifying : lua-srpm-macros-1-13.el8.noarch 172/172 Installed products updated. Installed: annobin-11.13-2.el8.x86_64 ansible-srpm-macros-1-12.el8.noarch audit-libs-3.0.7-5.el8.x86_64 basesystem-11-5.el8.noarch bash-4.4.20-4.el8_6.x86_64 binutils-2.30-123.el8.x86_64 brotli-1.0.6-3.el8.x86_64 bzip2-1.0.6-26.el8.x86_64 bzip2-libs-1.0.6-26.el8.x86_64 ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.noarch chkconfig-1.19.2-1.el8.x86_64 coreutils-8.30-15.el8.x86_64 coreutils-common-8.30-15.el8.x86_64 cpio-2.12-11.el8.x86_64 cpp-8.5.0-20.el8.x86_64 cracklib-2.9.6-15.el8.x86_64 cracklib-dicts-2.9.6-15.el8.x86_64 crypto-policies-20230731-1.git3177e06.el8.noarch curl-7.61.1-33.el8_9.5.x86_64 cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 diffutils-3.6-6.el8.x86_64 dwz-0.12-10.el8.x86_64 efi-srpm-macros-3-3.el8.noarch elfutils-0.189-3.el8.x86_64 elfutils-default-yama-scope-0.189-3.el8.noarch elfutils-libelf-0.189-3.el8.x86_64 elfutils-libs-0.189-3.el8.x86_64 epel-rpm-macros-8-41.noarch expat-2.2.5-11.el8_9.1.x86_64 file-5.33-25.el8.x86_64 file-libs-5.33-25.el8.x86_64 filesystem-3.8-6.el8.x86_64 findutils-1:4.6.0-21.el8.x86_64 fpc-srpm-macros-1.3-1.el8.noarch gawk-4.2.1-4.el8.x86_64 gc-7.6.4-3.el8.x86_64 gcc-8.5.0-20.el8.x86_64 gcc-c++-8.5.0-20.el8.x86_64 gcc-plugin-annobin-8.5.0-20.el8.x86_64 gdb-headless-8.2-20.el8.x86_64 gdbm-1:1.18-2.el8.x86_64 gdbm-libs-1:1.18-2.el8.x86_64 ghc-srpm-macros-1.4.2-7.el8.noarch glib2-2.56.4-161.el8.x86_64 glibc-2.28-236.el8_9.12.x86_64 glibc-all-langpacks-2.28-236.el8_9.12.x86_64 glibc-common-2.28-236.el8_9.12.x86_64 glibc-devel-2.28-236.el8_9.12.x86_64 glibc-gconv-extra-2.28-236.el8_9.12.x86_64 glibc-headers-2.28-236.el8_9.12.x86_64 gmp-1:6.1.2-10.el8.x86_64 gnupg2-2.2.20-3.el8_6.x86_64 gnutls-3.6.16-8.el8_9.3.x86_64 go-srpm-macros-2-17.el8.noarch grep-3.1-6.el8.x86_64 guile-5:2.0.14-7.el8.x86_64 gzip-1.9-13.el8_5.x86_64 ima-evm-utils-1.3.2-12.el8.x86_64 info-6.5-7.el8.x86_64 isl-0.16.1-6.el8.x86_64 kernel-headers-4.18.0-513.24.1.el8_9.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-26.el8_9.x86_64 libacl-2.2.53-1.el8.x86_64 libarchive-3.3.3-5.el8.x86_64 libassuan-2.5.1-3.el8.x86_64 libatomic_ops-7.6.2-3.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libbabeltrace-1.5.4-4.el8.x86_64 libblkid-2.32.1-44.el8_9.1.x86_64 libcap-2.48-6.el8_9.x86_64 libcap-ng-0.7.11-1.el8.x86_64 libcom_err-1.45.6-5.el8.x86_64 libcurl-7.61.1-33.el8_9.5.x86_64 libdb-5.3.28-42.el8_4.x86_64 libdb-utils-5.3.28-42.el8_4.x86_64 libfdisk-2.32.1-44.el8_9.1.x86_64 libffi-3.1-24.el8.x86_64 libgcc-8.5.0-20.el8.x86_64 libgcrypt-1.8.5-7.el8_6.x86_64 libgomp-8.5.0-20.el8.x86_64 libgpg-error-1.31-1.el8.x86_64 libidn2-2.2.0-1.el8.x86_64 libipt-1.6.1-8.el8.x86_64 libksba-1.3.5-9.el8_7.x86_64 libmount-2.32.1-44.el8_9.1.x86_64 libmpc-1.1.0-9.1.el8.x86_64 libnghttp2-1.33.0-5.el8_9.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libpkgconf-1.4.2-1.el8.x86_64 libpsl-0.20.2-6.el8.x86_64 libpwquality-1.4.4-6.el8.x86_64 libselinux-2.9-8.el8.x86_64 libsemanage-2.9-9.el8_6.x86_64 libsepol-2.9-3.el8.x86_64 libsigsegv-2.11-5.el8.x86_64 libsmartcols-2.32.1-44.el8_9.1.x86_64 libssh-0.9.6-13.el8_9.x86_64 libssh-config-0.9.6-13.el8_9.noarch libstdc++-8.5.0-20.el8.x86_64 libstdc++-devel-8.5.0-20.el8.x86_64 libtasn1-4.13-4.el8_7.x86_64 libtirpc-1.1.4-8.el8.x86_64 libtool-ltdl-2.4.6-25.el8.x86_64 libunistring-0.9.9-3.el8.x86_64 libusbx-1.0.23-4.el8.x86_64 libutempter-1.1.6-14.el8.x86_64 libuuid-2.32.1-44.el8_9.1.x86_64 libverto-0.3.2-2.el8.x86_64 libxcrypt-4.1.1-6.el8.x86_64 libxcrypt-devel-4.1.1-6.el8.x86_64 libxml2-2.9.7-18.el8_9.x86_64 libzstd-1.4.4-1.el8.x86_64 lua-libs-5.3.4-12.el8.x86_64 lua-srpm-macros-1-13.el8.noarch lz4-libs-1.8.3-3.el8_4.x86_64 make-1:4.2.1-11.el8.x86_64 mpfr-3.1.6-1.el8.x86_64 ncurses-6.1-10.20180224.el8.x86_64 ncurses-base-6.1-10.20180224.el8.noarch ncurses-libs-6.1-10.20180224.el8.x86_64 nettle-3.4.1-7.el8.x86_64 npth-1.5-4.el8.x86_64 ocaml-srpm-macros-5-4.el8.noarch openblas-srpm-macros-2-2.el8.noarch openldap-2.4.46-18.el8.x86_64 openssl-libs-1:1.1.1k-12.el8_9.x86_64 p11-kit-0.23.22-1.el8.x86_64 p11-kit-trust-0.23.22-1.el8.x86_64 pam-1.3.1-27.el8.x86_64 patch-2.7.6-11.el8.x86_64 pcre-8.42-6.el8.x86_64 pcre2-10.32-3.el8_6.x86_64 perl-srpm-macros-1-25.el8.noarch pkgconf-1.4.2-1.el8.x86_64 pkgconf-m4-1.4.2-1.el8.noarch pkgconf-pkg-config-1.4.2-1.el8.x86_64 platform-python-3.6.8-56.el8_9.3.x86_64 platform-python-setuptools-39.2.0-7.el8.noarch popt-1.18-1.el8.x86_64 publicsuffix-list-dafsa-20180723-1.el8.noarch python-rpm-macros-3-45.el8.noarch python-srpm-macros-3-45.el8.noarch python3-libs-3.6.8-56.el8_9.3.x86_64 python3-pip-wheel-9.0.3-23.el8_9.1.noarch python3-rpm-macros-3-45.el8.noarch python3-setuptools-wheel-39.2.0-7.el8.noarch qt5-srpm-macros-5.15.3-1.el8.noarch readline-7.0-10.el8.x86_64 redhat-release-8.9-0.1.el8.x86_64 redhat-rpm-config-131-1.el8.noarch rpm-4.14.3-28.el8_9.x86_64 rpm-build-4.14.3-28.el8_9.x86_64 rpm-build-libs-4.14.3-28.el8_9.x86_64 rpm-libs-4.14.3-28.el8_9.x86_64 rust-srpm-macros-5-2.el8.noarch sed-4.5-5.el8.x86_64 setup-2.12.2-9.el8.noarch shadow-utils-2:4.6-19.el8.x86_64 sqlite-libs-3.26.0-19.el8_9.x86_64 systemd-libs-239-78.el8.x86_64 tar-2:1.30-9.el8.x86_64 tpm2-tss-2.3.2-5.el8.x86_64 tzdata-2024a-1.el8.noarch unzip-6.0-46.el8.x86_64 util-linux-2.32.1-44.el8_9.1.x86_64 which-2.21-20.el8.x86_64 xz-5.2.4-4.el8_6.x86_64 xz-libs-5.2.4-4.el8_6.x86_64 zip-3.0-23.el8.x86_64 zlib-1.2.11-25.el8.x86_64 zstd-1.4.4-1.el8.x86_64 Complete! Finish: installing minimal buildroot with dnf Start: creating root cache Finish: creating root cache Finish: chroot init INFO: Installed packages: INFO: annobin-11.13-2.el8.x86_64 ansible-srpm-macros-1-12.el8.noarch audit-libs-3.0.7-5.el8.x86_64 basesystem-11-5.el8.noarch bash-4.4.20-4.el8_6.x86_64 binutils-2.30-123.el8.x86_64 brotli-1.0.6-3.el8.x86_64 bzip2-1.0.6-26.el8.x86_64 bzip2-libs-1.0.6-26.el8.x86_64 ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.noarch chkconfig-1.19.2-1.el8.x86_64 coreutils-8.30-15.el8.x86_64 coreutils-common-8.30-15.el8.x86_64 cpio-2.12-11.el8.x86_64 cpp-8.5.0-20.el8.x86_64 cracklib-2.9.6-15.el8.x86_64 cracklib-dicts-2.9.6-15.el8.x86_64 crypto-policies-20230731-1.git3177e06.el8.noarch curl-7.61.1-33.el8_9.5.x86_64 cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 diffutils-3.6-6.el8.x86_64 dwz-0.12-10.el8.x86_64 efi-srpm-macros-3-3.el8.noarch elfutils-0.189-3.el8.x86_64 elfutils-default-yama-scope-0.189-3.el8.noarch elfutils-libelf-0.189-3.el8.x86_64 elfutils-libs-0.189-3.el8.x86_64 epel-rpm-macros-8-41.noarch expat-2.2.5-11.el8_9.1.x86_64 file-5.33-25.el8.x86_64 file-libs-5.33-25.el8.x86_64 filesystem-3.8-6.el8.x86_64 findutils-4.6.0-21.el8.x86_64 fpc-srpm-macros-1.3-1.el8.noarch gawk-4.2.1-4.el8.x86_64 gc-7.6.4-3.el8.x86_64 gcc-8.5.0-20.el8.x86_64 gcc-c++-8.5.0-20.el8.x86_64 gcc-plugin-annobin-8.5.0-20.el8.x86_64 gdb-headless-8.2-20.el8.x86_64 gdbm-1.18-2.el8.x86_64 gdbm-libs-1.18-2.el8.x86_64 ghc-srpm-macros-1.4.2-7.el8.noarch glib2-2.56.4-161.el8.x86_64 glibc-2.28-236.el8_9.12.x86_64 glibc-all-langpacks-2.28-236.el8_9.12.x86_64 glibc-common-2.28-236.el8_9.12.x86_64 glibc-devel-2.28-236.el8_9.12.x86_64 glibc-gconv-extra-2.28-236.el8_9.12.x86_64 glibc-headers-2.28-236.el8_9.12.x86_64 gmp-6.1.2-10.el8.x86_64 gnupg2-2.2.20-3.el8_6.x86_64 gnutls-3.6.16-8.el8_9.3.x86_64 go-srpm-macros-2-17.el8.noarch gpg-pubkey-2f86d6a1-5cf7cefb gpg-pubkey-2fa658e0-45700c69 gpg-pubkey-fd431d51-4ae0493b grep-3.1-6.el8.x86_64 guile-2.0.14-7.el8.x86_64 gzip-1.9-13.el8_5.x86_64 ima-evm-utils-1.3.2-12.el8.x86_64 info-6.5-7.el8.x86_64 isl-0.16.1-6.el8.x86_64 kernel-headers-4.18.0-513.24.1.el8_9.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-26.el8_9.x86_64 libacl-2.2.53-1.el8.x86_64 libarchive-3.3.3-5.el8.x86_64 libassuan-2.5.1-3.el8.x86_64 libatomic_ops-7.6.2-3.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libbabeltrace-1.5.4-4.el8.x86_64 libblkid-2.32.1-44.el8_9.1.x86_64 libcap-2.48-6.el8_9.x86_64 libcap-ng-0.7.11-1.el8.x86_64 libcom_err-1.45.6-5.el8.x86_64 libcurl-7.61.1-33.el8_9.5.x86_64 libdb-5.3.28-42.el8_4.x86_64 libdb-utils-5.3.28-42.el8_4.x86_64 libfdisk-2.32.1-44.el8_9.1.x86_64 libffi-3.1-24.el8.x86_64 libgcc-8.5.0-20.el8.x86_64 libgcrypt-1.8.5-7.el8_6.x86_64 libgomp-8.5.0-20.el8.x86_64 libgpg-error-1.31-1.el8.x86_64 libidn2-2.2.0-1.el8.x86_64 libipt-1.6.1-8.el8.x86_64 libksba-1.3.5-9.el8_7.x86_64 libmount-2.32.1-44.el8_9.1.x86_64 libmpc-1.1.0-9.1.el8.x86_64 libnghttp2-1.33.0-5.el8_9.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libpkgconf-1.4.2-1.el8.x86_64 libpsl-0.20.2-6.el8.x86_64 libpwquality-1.4.4-6.el8.x86_64 libselinux-2.9-8.el8.x86_64 libsemanage-2.9-9.el8_6.x86_64 libsepol-2.9-3.el8.x86_64 libsigsegv-2.11-5.el8.x86_64 libsmartcols-2.32.1-44.el8_9.1.x86_64 libssh-0.9.6-13.el8_9.x86_64 libssh-config-0.9.6-13.el8_9.noarch libstdc++-8.5.0-20.el8.x86_64 libstdc++-devel-8.5.0-20.el8.x86_64 libtasn1-4.13-4.el8_7.x86_64 libtirpc-1.1.4-8.el8.x86_64 libtool-ltdl-2.4.6-25.el8.x86_64 libunistring-0.9.9-3.el8.x86_64 libusbx-1.0.23-4.el8.x86_64 libutempter-1.1.6-14.el8.x86_64 libuuid-2.32.1-44.el8_9.1.x86_64 libverto-0.3.2-2.el8.x86_64 libxcrypt-4.1.1-6.el8.x86_64 libxcrypt-devel-4.1.1-6.el8.x86_64 libxml2-2.9.7-18.el8_9.x86_64 libzstd-1.4.4-1.el8.x86_64 lua-libs-5.3.4-12.el8.x86_64 lua-srpm-macros-1-13.el8.noarch lz4-libs-1.8.3-3.el8_4.x86_64 make-4.2.1-11.el8.x86_64 mpfr-3.1.6-1.el8.x86_64 ncurses-6.1-10.20180224.el8.x86_64 ncurses-base-6.1-10.20180224.el8.noarch ncurses-libs-6.1-10.20180224.el8.x86_64 nettle-3.4.1-7.el8.x86_64 npth-1.5-4.el8.x86_64 ocaml-srpm-macros-5-4.el8.noarch openblas-srpm-macros-2-2.el8.noarch openldap-2.4.46-18.el8.x86_64 openssl-libs-1.1.1k-12.el8_9.x86_64 p11-kit-0.23.22-1.el8.x86_64 p11-kit-trust-0.23.22-1.el8.x86_64 pam-1.3.1-27.el8.x86_64 patch-2.7.6-11.el8.x86_64 pcre-8.42-6.el8.x86_64 pcre2-10.32-3.el8_6.x86_64 perl-srpm-macros-1-25.el8.noarch pkgconf-1.4.2-1.el8.x86_64 pkgconf-m4-1.4.2-1.el8.noarch pkgconf-pkg-config-1.4.2-1.el8.x86_64 platform-python-3.6.8-56.el8_9.3.x86_64 platform-python-setuptools-39.2.0-7.el8.noarch popt-1.18-1.el8.x86_64 publicsuffix-list-dafsa-20180723-1.el8.noarch python-rpm-macros-3-45.el8.noarch python-srpm-macros-3-45.el8.noarch python3-libs-3.6.8-56.el8_9.3.x86_64 python3-pip-wheel-9.0.3-23.el8_9.1.noarch python3-rpm-macros-3-45.el8.noarch python3-setuptools-wheel-39.2.0-7.el8.noarch qt5-srpm-macros-5.15.3-1.el8.noarch readline-7.0-10.el8.x86_64 redhat-release-8.9-0.1.el8.x86_64 redhat-rpm-config-131-1.el8.noarch rpm-4.14.3-28.el8_9.x86_64 rpm-build-4.14.3-28.el8_9.x86_64 rpm-build-libs-4.14.3-28.el8_9.x86_64 rpm-libs-4.14.3-28.el8_9.x86_64 rust-srpm-macros-5-2.el8.noarch sed-4.5-5.el8.x86_64 setup-2.12.2-9.el8.noarch shadow-utils-4.6-19.el8.x86_64 sqlite-libs-3.26.0-19.el8_9.x86_64 systemd-libs-239-78.el8.x86_64 tar-1.30-9.el8.x86_64 tpm2-tss-2.3.2-5.el8.x86_64 tzdata-2024a-1.el8.noarch unzip-6.0-46.el8.x86_64 util-linux-2.32.1-44.el8_9.1.x86_64 which-2.21-20.el8.x86_64 xz-5.2.4-4.el8_6.x86_64 xz-libs-5.2.4-4.el8_6.x86_64 zip-3.0-23.el8.x86_64 zlib-1.2.11-25.el8.x86_64 zstd-1.4.4-1.el8.x86_64 Start: buildsrpm Start: rpmbuild -bs sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: x86_64 Building for target x86_64 Wrote: /builddir/build/SRPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Finish: rpmbuild -bs cp: preserving permissions for ‘/var/lib/copr-rpmbuild/results/chroot_scan/var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log’: No such file or directory INFO: chroot_scan: 3 files copied to /var/lib/copr-rpmbuild/results/chroot_scan INFO: /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log/dnf.rpm.log /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log/dnf.librepo.log /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log/dnf.log Finish: buildsrpm INFO: Done(/var/lib/copr-rpmbuild/workspace/workdir-hd8xdoow/cutlass/cutlass.spec) Config(child) 1 minutes 16 seconds INFO: Results and/or logs in: /var/lib/copr-rpmbuild/results INFO: Cleaning up build root ('cleanup_on_success=True') Start: clean chroot INFO: unmounting tmpfs. Finish: clean chroot INFO: Start(/var/lib/copr-rpmbuild/results/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm) Config(rhel+epel-8-x86_64) Start(bootstrap): chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-x86_64-bootstrap-1713469181.334935/root. INFO: reusing tmpfs at /var/lib/mock/rhel+epel-8-x86_64-bootstrap-1713469181.334935/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start(bootstrap): cleaning package manager metadata Finish(bootstrap): cleaning package manager metadata Finish(bootstrap): chroot init Start: chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root. INFO: calling preinit hooks INFO: enabled root cache Start: unpacking root cache Finish: unpacking root cache INFO: enabled package manager cache Start: cleaning package manager metadata Finish: cleaning package manager metadata INFO: enabled HW Info plugin INFO: Buildroot is handled by package management downloaded with a bootstrap image: rpm-4.14.3-28.el8_9.x86_64 python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch yum-4.7.0-19.el8.noarch Finish: chroot init Start: build phase for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Start: build setup for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: x86_64 Building for target x86_64 Wrote: /builddir/build/SRPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm No matches found for the following disable plugin patterns: local, spacewalk, versionlock Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Copr repository 90 kB/s | 1.8 kB 00:00 Additional repo copr_rezso_CUDA 126 kB/s | 1.8 kB 00:00 Additional repo http_developer_download_nvidia_ 913 kB/s | 3.5 kB 00:00 Additional repo http_developer_download_nvidia_ 1.0 MB/s | 3.5 kB 00:00 Additional repo http_developer_download_nvidia_ 988 kB/s | 3.5 kB 00:00 Red Hat Enterprise Linux - BaseOS 32 kB/s | 4.1 kB 00:00 Red Hat Enterprise Linux - AppStream 52 kB/s | 4.5 kB 00:00 Red Hat Enterprise Linux - CodeReady Linux Buil 50 kB/s | 4.5 kB 00:00 Extra Packages for Enterprise Linux 8 - x86_64 94 kB/s | 26 kB 00:00 Modular dependency problems: Problem 1: nothing provides requested module(nvidia-driver:latest-dkms:20240416084055) Problem 2: nothing provides requested module(nvidia-driver:latest-dkms:20240416084208) Package gcc-c++-8.5.0-20.el8.x86_64 is already installed. Dependencies resolved. ================================================================================================================================================================= Package Arch Version Repository Size ================================================================================================================================================================= Installing: cmake x86_64 3.26.5-1.el8_9 rhel-appstream 14 M cuda-cudart-devel-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 2.0 M cuda-driver-devel-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 42 k cuda-nvcc-12-4 x86_64 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 69 M cuda-nvml-devel-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 219 k cuda-nvrtc-devel-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 27 M cuda-nvtx-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 88 k doxygen x86_64 1:1.8.14-12.el8 codeready-builder 3.9 M git x86_64 2.39.3-1.el8_8 rhel-appstream 104 k graphviz x86_64 2.40.1-44.el8 rhel-appstream 1.8 M libcublas-devel-12-4 x86_64 12.4.5.8-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 400 M libcudnn8 x86_64 8.9.7.29-2.cuda12.3 copr_rezso_CUDA 467 M libcudnn8-devel x86_64 8.9.7.29-2.cuda12.3 copr_rezso_CUDA 35 k libcurand-devel-12-4 x86_64 10.3.5.147-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 53 M python3-setuptools noarch 39.2.0-7.el8 rhel-baseos 163 k python36-devel x86_64 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 17 k Installing dependencies: adobe-mappings-cmap noarch 20171205-3.el8 rhel-appstream 2.1 M adobe-mappings-cmap-deprecated noarch 20171205-3.el8 rhel-appstream 119 k adobe-mappings-pdf noarch 20180407-1.el8 rhel-appstream 707 k atk x86_64 2.28.1-1.el8 rhel-appstream 272 k avahi-libs x86_64 0.7-21.el8_9.1 rhel-baseos 62 k cairo x86_64 1.15.12-6.el8 rhel-appstream 719 k cmake-data noarch 3.26.5-1.el8_9 rhel-appstream 1.9 M cmake-filesystem x86_64 3.26.5-1.el8_9 rhel-appstream 45 k cmake-rpm-macros noarch 3.26.5-1.el8_9 rhel-appstream 44 k cuda-cccl-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 1.9 M cuda-crt-12-4 x86_64 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 112 k cuda-cudart-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 224 k cuda-nvrtc-12-4 x86_64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 23 M cuda-nvvm-12-4 x86_64 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 26 M cuda-toolkit-12-4-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.7 k cuda-toolkit-12-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.9 k cuda-toolkit-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.9 k cups-libs x86_64 1:2.2.6-54.el8_9 rhel-baseos 435 k dbus-libs x86_64 1:1.12.8-26.el8 rhel-baseos 185 k emacs-filesystem noarch 1:26.1-11.el8 rhel-baseos 70 k fontconfig x86_64 2.13.1-4.el8 rhel-baseos 274 k fontpackages-filesystem noarch 1.44-22.el8 rhel-baseos 16 k freetype x86_64 2.9.1-9.el8 rhel-baseos 394 k fribidi x86_64 1.0.4-9.el8 rhel-appstream 89 k gd x86_64 2.2.5-7.el8 rhel-appstream 144 k gdk-pixbuf2 x86_64 2.36.12-5.el8 rhel-baseos 467 k gdk-pixbuf2-modules x86_64 2.36.12-5.el8 rhel-appstream 109 k git-core x86_64 2.39.3-1.el8_8 rhel-appstream 11 M git-core-doc noarch 2.39.3-1.el8_8 rhel-appstream 3.0 M google-droid-sans-fonts noarch 20120715-13.el8 rhel-appstream 2.5 M graphite2 x86_64 1.3.10-10.el8 rhel-appstream 122 k groff-base x86_64 1.22.3-18.el8 rhel-baseos 1.0 M gtk-update-icon-cache x86_64 3.22.30-11.el8 rhel-appstream 33 k gtk2 x86_64 2.24.32-5.el8 rhel-appstream 3.4 M harfbuzz x86_64 1.7.5-3.el8 rhel-appstream 294 k hicolor-icon-theme noarch 0.17-2.el8 rhel-appstream 48 k jasper-libs x86_64 2.0.14-5.el8 rhel-appstream 167 k jbig2dec-libs x86_64 0.16-1.el8 rhel-appstream 72 k jbigkit-libs x86_64 2.1-14.el8 rhel-appstream 55 k lcms2 x86_64 2.9-2.el8 rhel-appstream 165 k less x86_64 530-2.el8_9 rhel-baseos 164 k libICE x86_64 1.0.9-15.el8 rhel-appstream 74 k libSM x86_64 1.2.3-1.el8 rhel-appstream 48 k libX11 x86_64 1.6.8-6.el8 rhel-appstream 611 k libX11-common noarch 1.6.8-6.el8 rhel-appstream 158 k libXau x86_64 1.0.9-3.el8 rhel-appstream 37 k libXaw x86_64 1.0.13-10.el8 rhel-appstream 194 k libXcomposite x86_64 0.4.4-14.el8 rhel-appstream 29 k libXcursor x86_64 1.1.15-3.el8 rhel-appstream 36 k libXdamage x86_64 1.1.4-14.el8 rhel-appstream 27 k libXext x86_64 1.3.4-1.el8 rhel-appstream 45 k libXfixes x86_64 5.0.3-7.el8 rhel-appstream 25 k libXft x86_64 2.3.3-1.el8 rhel-appstream 67 k libXi x86_64 1.7.10-1.el8 rhel-appstream 49 k libXinerama x86_64 1.1.4-1.el8 rhel-appstream 16 k libXmu x86_64 1.1.3-1.el8 rhel-appstream 75 k libXpm x86_64 3.5.12-9.el8_7 rhel-appstream 58 k libXrandr x86_64 1.5.2-1.el8 rhel-appstream 34 k libXrender x86_64 0.9.10-7.el8 rhel-appstream 33 k libXt x86_64 1.1.5-12.el8 rhel-appstream 185 k libXxf86misc x86_64 1.0.4-1.el8 rhel-appstream 23 k libXxf86vm x86_64 1.1.4-9.el8 rhel-appstream 19 k libcroco x86_64 0.6.12-4.el8_2.1 rhel-baseos 113 k libcublas-12-4 x86_64 12.4.5.8-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 346 M libcurand-12-4 x86_64 10.3.5.147-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 53 M libdatrie x86_64 0.2.9-7.el8 rhel-appstream 33 k libedit x86_64 3.1-23.20170329cvs.el8 rhel-baseos 102 k libfontenc x86_64 1.1.3-8.el8 rhel-appstream 37 k libgs x86_64 9.27-11.el8 rhel-appstream 3.1 M libidn x86_64 1.34-5.el8 rhel-appstream 239 k libijs x86_64 0.35-5.el8 rhel-appstream 30 k libjpeg-turbo x86_64 1.5.3-12.el8 rhel-appstream 157 k libmcpp x86_64 2.7.2-20.el8 rhel-appstream 81 k libpaper x86_64 1.1.24-22.el8 rhel-appstream 45 k libpng x86_64 2:1.6.34-5.el8 rhel-baseos 126 k librsvg2 x86_64 2.42.7-5.el8 rhel-appstream 575 k libthai x86_64 0.1.27-2.el8 rhel-appstream 203 k libtiff x86_64 4.0.9-29.el8_8 rhel-appstream 189 k libuv x86_64 1:1.41.1-1.el8_4 rhel-appstream 156 k libwebp x86_64 1.0.0-9.el8_9.1 rhel-appstream 274 k libxcb x86_64 1.13.1-1.el8 rhel-appstream 229 k mcpp x86_64 2.7.2-20.el8 rhel-appstream 31 k openjpeg2 x86_64 2.4.0-5.el8 rhel-appstream 165 k openssh x86_64 8.0p1-19.el8_9.2 rhel-baseos 525 k openssh-clients x86_64 8.0p1-19.el8_9.2 rhel-baseos 645 k openssl x86_64 1:1.1.1k-12.el8_9 rhel-baseos 711 k pango x86_64 1.42.4-8.el8 rhel-appstream 297 k perl-Carp noarch 1.42-396.el8 rhel-baseos 30 k perl-Data-Dumper x86_64 2.167-399.el8 rhel-baseos 58 k perl-Digest noarch 1.17-395.el8 rhel-baseos 27 k perl-Digest-MD5 x86_64 2.55-396.el8 rhel-baseos 37 k perl-Encode x86_64 4:2.97-3.el8 rhel-baseos 1.5 M perl-Errno x86_64 1.28-422.el8 rhel-baseos 77 k perl-Error noarch 1:0.17025-2.el8 rhel-appstream 46 k perl-Exporter noarch 5.72-396.el8 rhel-baseos 34 k perl-File-Path noarch 2.15-2.el8 rhel-baseos 38 k perl-File-Temp noarch 0.230.600-1.el8 rhel-baseos 63 k perl-Getopt-Long noarch 1:2.50-4.el8 rhel-baseos 63 k perl-Git noarch 2.39.3-1.el8_8 rhel-appstream 79 k perl-HTTP-Tiny noarch 0.074-2.el8_9.1 rhel-baseos 59 k perl-IO x86_64 1.38-422.el8 rhel-baseos 142 k perl-IO-Socket-IP noarch 0.39-5.el8 rhel-baseos 47 k perl-IO-Socket-SSL noarch 2.066-4.module+el8.3.0+6446+594cad75 rhel-appstream 298 k perl-MIME-Base64 x86_64 3.15-396.el8 rhel-baseos 31 k perl-Mozilla-CA noarch 20160104-7.module+el8.3.0+6498+9eecfe51 rhel-appstream 15 k perl-Net-SSLeay x86_64 1.88-2.module+el8.6.0+13392+f0897f98 rhel-appstream 379 k perl-PathTools x86_64 3.74-1.el8 rhel-baseos 90 k perl-Pod-Escapes noarch 1:1.07-395.el8 rhel-baseos 20 k perl-Pod-Perldoc noarch 3.28-396.el8 rhel-baseos 88 k perl-Pod-Simple noarch 1:3.35-395.el8 rhel-baseos 213 k perl-Pod-Usage noarch 4:1.69-395.el8 rhel-baseos 34 k perl-Scalar-List-Utils x86_64 3:1.49-2.el8 rhel-baseos 68 k perl-Socket x86_64 4:2.027-3.el8 rhel-baseos 59 k perl-Storable x86_64 1:3.11-3.el8 rhel-baseos 98 k perl-Term-ANSIColor noarch 4.06-396.el8 rhel-baseos 46 k perl-Term-Cap noarch 1.17-395.el8 rhel-baseos 23 k perl-TermReadKey x86_64 2.37-7.el8 rhel-appstream 40 k perl-Text-ParseWords noarch 3.30-395.el8 rhel-baseos 18 k perl-Text-Tabs+Wrap noarch 2013.0523-395.el8 rhel-baseos 24 k perl-Time-Local noarch 1:1.280-1.el8 rhel-baseos 34 k perl-URI noarch 1.73-3.el8 rhel-baseos 116 k perl-Unicode-Normalize x86_64 1.25-396.el8 rhel-baseos 82 k perl-constant noarch 1.33-396.el8 rhel-baseos 25 k perl-interpreter x86_64 4:5.26.3-422.el8 rhel-baseos 6.3 M perl-libnet noarch 3.11-3.el8 rhel-baseos 121 k perl-libs x86_64 4:5.26.3-422.el8 rhel-baseos 1.6 M perl-macros x86_64 4:5.26.3-422.el8 rhel-baseos 73 k perl-parent noarch 1:0.237-1.el8 rhel-baseos 20 k perl-podlators noarch 4.11-1.el8 rhel-baseos 118 k perl-threads x86_64 1:2.21-2.el8 rhel-baseos 61 k perl-threads-shared x86_64 1.58-2.el8 rhel-baseos 48 k pixman x86_64 0.38.4-3.el8_9 rhel-appstream 258 k platform-python-devel x86_64 3.6.8-56.el8_9.3 rhel-appstream 241 k platform-python-pip noarch 9.0.3-23.el8_9.1 rhel-baseos 1.6 M python3-pip noarch 9.0.3-23.el8_9.1 rhel-appstream 20 k python3-rpm-generators noarch 5-8.el8 rhel-appstream 25 k python36 x86_64 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 19 k python36-rpm-macros noarch 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 16 k shared-mime-info x86_64 1.9-3.el8 rhel-baseos 329 k urw-base35-bookman-fonts noarch 20170801-10.el8 rhel-appstream 857 k urw-base35-c059-fonts noarch 20170801-10.el8 rhel-appstream 884 k urw-base35-d050000l-fonts noarch 20170801-10.el8 rhel-appstream 79 k urw-base35-fonts noarch 20170801-10.el8 rhel-appstream 12 k urw-base35-fonts-common noarch 20170801-10.el8 rhel-appstream 23 k urw-base35-gothic-fonts noarch 20170801-10.el8 rhel-appstream 654 k urw-base35-nimbus-mono-ps-fonts noarch 20170801-10.el8 rhel-appstream 801 k urw-base35-nimbus-roman-fonts noarch 20170801-10.el8 rhel-appstream 865 k urw-base35-nimbus-sans-fonts noarch 20170801-10.el8 rhel-appstream 1.3 M urw-base35-p052-fonts noarch 20170801-10.el8 rhel-appstream 982 k urw-base35-standard-symbols-ps-fonts noarch 20170801-10.el8 rhel-appstream 44 k urw-base35-z003-fonts noarch 20170801-10.el8 rhel-appstream 279 k vim-filesystem noarch 2:8.0.1763-19.el8_6.4 rhel-appstream 50 k xorg-x11-font-utils x86_64 1:7.5-41.el8 rhel-appstream 104 k xorg-x11-fonts-ISO8859-1-100dpi noarch 7.5-19.el8 rhel-appstream 1.1 M xorg-x11-server-utils x86_64 7.7-27.el8 rhel-appstream 197 k Enabling module streams: perl 5.26 perl-IO-Socket-SSL 2.066 perl-libwww-perl 6.34 python36 3.6 Transaction Summary ================================================================================================================================================================= Install 171 Packages Total download size: 1.5 G Installed size: 3.2 G Downloading Packages: (1/171): cuda-cccl-12-4-12.4.127-1.x86_64.rpm 98 MB/s | 1.9 MB 00:00 (2/171): libcudnn8-devel-8.9.7.29-2.cuda12.3.x8 1.6 MB/s | 35 kB 00:00 (3/171): cuda-crt-12-4-12.4.131-1.x86_64.rpm 41 MB/s | 112 kB 00:00 (4/171): cuda-cudart-12-4-12.4.127-1.x86_64.rpm 44 MB/s | 224 kB 00:00 (5/171): cuda-driver-devel-12-4-12.4.127-1.x86_ 20 MB/s | 42 kB 00:00 (6/171): cuda-cudart-devel-12-4-12.4.127-1.x86_ 178 MB/s | 2.0 MB 00:00 (7/171): cuda-nvml-devel-12-4-12.4.127-1.x86_64 53 MB/s | 219 kB 00:00 (8/171): cuda-nvrtc-12-4-12.4.127-1.x86_64.rpm 174 MB/s | 23 MB 00:00 (9/171): cuda-nvrtc-devel-12-4-12.4.127-1.x86_6 161 MB/s | 27 MB 00:00 (10/171): cuda-nvtx-12-4-12.4.127-1.x86_64.rpm 27 MB/s | 88 kB 00:00 (11/171): cuda-nvcc-12-4-12.4.131-1.x86_64.rpm 150 MB/s | 69 MB 00:00 (12/171): cuda-toolkit-12-4-config-common-12.4. 2.4 MB/s | 7.7 kB 00:00 (13/171): cuda-toolkit-12-config-common-12.4.12 3.9 MB/s | 7.9 kB 00:00 (14/171): cuda-toolkit-config-common-12.4.127-1 4.3 MB/s | 7.9 kB 00:00 (15/171): cuda-nvvm-12-4-12.4.131-1.x86_64.rpm 118 MB/s | 26 MB 00:00 (16/171): libcublas-12-4-12.4.5.8-1.x86_64.rpm 167 MB/s | 346 MB 00:02 (17/171): libcurand-12-4-10.3.5.147-1.x86_64.rp 164 MB/s | 53 MB 00:00 (18/171): libcublas-devel-12-4-12.4.5.8-1.x86_6 146 MB/s | 400 MB 00:02 (19/171): libcudnn8-8.9.7.29-2.cuda12.3.x86_64. 123 MB/s | 467 MB 00:03 (20/171): libcurand-devel-12-4-10.3.5.147-1.x86 51 MB/s | 53 MB 00:01 (21/171): libedit-3.1-23.20170329cvs.el8.x86_64 691 kB/s | 102 kB 00:00 (22/171): groff-base-1.22.3-18.el8.x86_64.rpm 1.6 MB/s | 1.0 MB 00:00 (23/171): perl-Data-Dumper-2.167-399.el8.x86_64 618 kB/s | 58 kB 00:00 (24/171): libpng-1.6.34-5.el8.x86_64.rpm 969 kB/s | 126 kB 00:00 (25/171): perl-Encode-2.97-3.el8.x86_64.rpm 9.4 MB/s | 1.5 MB 00:00 (26/171): perl-Scalar-List-Utils-1.49-2.el8.x86 781 kB/s | 68 kB 00:00 (27/171): perl-MIME-Base64-3.15-396.el8.x86_64. 179 kB/s | 31 kB 00:00 (28/171): perl-PathTools-3.74-1.el8.x86_64.rpm 534 kB/s | 90 kB 00:00 (29/171): perl-Unicode-Normalize-1.25-396.el8.x 993 kB/s | 82 kB 00:00 (30/171): shared-mime-info-1.9-3.el8.x86_64.rpm 4.7 MB/s | 329 kB 00:00 (31/171): fontpackages-filesystem-1.44-22.el8.n 275 kB/s | 16 kB 00:00 (32/171): perl-Carp-1.42-396.el8.noarch.rpm 573 kB/s | 30 kB 00:00 (33/171): perl-threads-shared-1.58-2.el8.x86_64 304 kB/s | 48 kB 00:00 (34/171): perl-Exporter-5.72-396.el8.noarch.rpm 535 kB/s | 34 kB 00:00 (35/171): perl-File-Path-2.15-2.el8.noarch.rpm 589 kB/s | 38 kB 00:00 (36/171): perl-File-Temp-0.230.600-1.el8.noarch 769 kB/s | 63 kB 00:00 (37/171): perl-Getopt-Long-2.50-4.el8.noarch.rp 847 kB/s | 63 kB 00:00 (38/171): perl-Pod-Escapes-1.07-395.el8.noarch. 308 kB/s | 20 kB 00:00 (39/171): perl-Pod-Perldoc-3.28-396.el8.noarch. 1.4 MB/s | 88 kB 00:00 (40/171): perl-Pod-Usage-1.69-395.el8.noarch.rp 622 kB/s | 34 kB 00:00 (41/171): perl-Storable-3.11-3.el8.x86_64.rpm 1.5 MB/s | 98 kB 00:00 (42/171): perl-Pod-Simple-3.35-395.el8.noarch.r 1.8 MB/s | 213 kB 00:00 (43/171): perl-Term-ANSIColor-4.06-396.el8.noar 762 kB/s | 46 kB 00:00 (44/171): perl-Term-Cap-1.17-395.el8.noarch.rpm 416 kB/s | 23 kB 00:00 (45/171): perl-Text-ParseWords-3.30-395.el8.noa 310 kB/s | 18 kB 00:00 (46/171): perl-Text-Tabs+Wrap-2013.0523-395.el8 405 kB/s | 24 kB 00:00 (47/171): perl-constant-1.33-396.el8.noarch.rpm 445 kB/s | 25 kB 00:00 (48/171): perl-parent-0.237-1.el8.noarch.rpm 370 kB/s | 20 kB 00:00 (49/171): perl-Time-Local-1.280-1.el8.noarch.rp 311 kB/s | 34 kB 00:00 (50/171): perl-podlators-4.11-1.el8.noarch.rpm 2.0 MB/s | 118 kB 00:00 (51/171): perl-threads-2.21-2.el8.x86_64.rpm 1.0 MB/s | 61 kB 00:00 (52/171): gdk-pixbuf2-2.36.12-5.el8.x86_64.rpm 6.3 MB/s | 467 kB 00:00 (53/171): perl-Socket-2.027-3.el8.x86_64.rpm 915 kB/s | 59 kB 00:00 (54/171): libcroco-0.6.12-4.el8_2.1.x86_64.rpm 1.3 MB/s | 113 kB 00:00 (55/171): freetype-2.9.1-9.el8.x86_64.rpm 5.8 MB/s | 394 kB 00:00 (56/171): perl-Errno-1.28-422.el8.x86_64.rpm 1.4 MB/s | 77 kB 00:00 (57/171): fontconfig-2.13.1-4.el8.x86_64.rpm 1.9 MB/s | 274 kB 00:00 (58/171): perl-interpreter-5.26.3-422.el8.x86_6 65 MB/s | 6.3 MB 00:00 (59/171): perl-IO-1.38-422.el8.x86_64.rpm 1.1 MB/s | 142 kB 00:00 (60/171): perl-libs-5.26.3-422.el8.x86_64.rpm 16 MB/s | 1.6 MB 00:00 (61/171): perl-macros-5.26.3-422.el8.x86_64.rpm 1.2 MB/s | 73 kB 00:00 (62/171): dbus-libs-1.12.8-26.el8.x86_64.rpm 2.2 MB/s | 185 kB 00:00 (63/171): emacs-filesystem-26.1-11.el8.noarch.r 1.1 MB/s | 70 kB 00:00 (64/171): perl-Digest-MD5-2.55-396.el8.x86_64.r 644 kB/s | 37 kB 00:00 (65/171): python3-setuptools-39.2.0-7.el8.noarc 795 kB/s | 163 kB 00:00 (66/171): perl-libnet-3.11-3.el8.noarch.rpm 2.0 MB/s | 121 kB 00:00 (67/171): perl-URI-1.73-3.el8.noarch.rpm 1.0 MB/s | 116 kB 00:00 (68/171): avahi-libs-0.7-21.el8_9.1.x86_64.rpm 795 kB/s | 62 kB 00:00 (69/171): cups-libs-2.2.6-54.el8_9.x86_64.rpm 7.3 MB/s | 435 kB 00:00 (70/171): openssl-1.1.1k-12.el8_9.x86_64.rpm 13 MB/s | 711 kB 00:00 (71/171): perl-Digest-1.17-395.el8.noarch.rpm 452 kB/s | 27 kB 00:00 (72/171): openssh-8.0p1-19.el8_9.2.x86_64.rpm 7.1 MB/s | 525 kB 00:00 (73/171): perl-IO-Socket-IP-0.39-5.el8.noarch.r 581 kB/s | 47 kB 00:00 (74/171): openssh-clients-8.0p1-19.el8_9.2.x86_ 8.4 MB/s | 645 kB 00:00 (75/171): perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 1.1 MB/s | 59 kB 00:00 (76/171): less-530-2.el8_9.x86_64.rpm 2.6 MB/s | 164 kB 00:00 (77/171): urw-base35-fonts-20170801-10.el8.noar 169 kB/s | 12 kB 00:00 (78/171): google-droid-sans-fonts-20120715-13.e 27 MB/s | 2.5 MB 00:00 (79/171): platform-python-pip-9.0.3-23.el8_9.1. 13 MB/s | 1.6 MB 00:00 (80/171): urw-base35-gothic-fonts-20170801-10.e 10 MB/s | 654 kB 00:00 (81/171): urw-base35-p052-fonts-20170801-10.el8 14 MB/s | 982 kB 00:00 (82/171): adobe-mappings-cmap-20171205-3.el8.no 17 MB/s | 2.1 MB 00:00 (83/171): adobe-mappings-cmap-deprecated-201712 1.1 MB/s | 119 kB 00:00 (84/171): xorg-x11-fonts-ISO8859-1-100dpi-7.5-1 6.7 MB/s | 1.1 MB 00:00 (85/171): hicolor-icon-theme-0.17-2.el8.noarch. 833 kB/s | 48 kB 00:00 (86/171): lcms2-2.9-2.el8.x86_64.rpm 1.4 MB/s | 165 kB 00:00 (87/171): adobe-mappings-pdf-20180407-1.el8.noa 4.3 MB/s | 707 kB 00:00 (88/171): perl-Error-0.17025-2.el8.noarch.rpm 468 kB/s | 46 kB 00:00 (89/171): perl-TermReadKey-2.37-7.el8.x86_64.rp 734 kB/s | 40 kB 00:00 (90/171): urw-base35-c059-fonts-20170801-10.el8 15 MB/s | 884 kB 00:00 (91/171): urw-base35-d050000l-fonts-20170801-10 1.4 MB/s | 79 kB 00:00 (92/171): urw-base35-fonts-common-20170801-10.e 292 kB/s | 23 kB 00:00 (93/171): urw-base35-bookman-fonts-20170801-10. 5.8 MB/s | 857 kB 00:00 (94/171): urw-base35-nimbus-sans-fonts-20170801 21 MB/s | 1.3 MB 00:00 (95/171): urw-base35-nimbus-roman-fonts-2017080 9.8 MB/s | 865 kB 00:00 (96/171): urw-base35-nimbus-mono-ps-fonts-20170 5.1 MB/s | 801 kB 00:00 (97/171): urw-base35-standard-symbols-ps-fonts- 818 kB/s | 44 kB 00:00 (98/171): urw-base35-z003-fonts-20170801-10.el8 4.0 MB/s | 279 kB 00:00 (99/171): graphite2-1.3.10-10.el8.x86_64.rpm 1.6 MB/s | 122 kB 00:00 (100/171): jbigkit-libs-2.1-14.el8.x86_64.rpm 776 kB/s | 55 kB 00:00 (101/171): libXcursor-1.1.15-3.el8.x86_64.rpm 639 kB/s | 36 kB 00:00 (102/171): libXxf86misc-1.0.4-1.el8.x86_64.rpm 211 kB/s | 23 kB 00:00 (103/171): libXinerama-1.1.4-1.el8.x86_64.rpm 75 kB/s | 16 kB 00:00 (104/171): mcpp-2.7.2-20.el8.x86_64.rpm 420 kB/s | 31 kB 00:00 (105/171): libSM-1.2.3-1.el8.x86_64.rpm 438 kB/s | 48 kB 00:00 (106/171): xorg-x11-server-utils-7.7-27.el8.x86 1.2 MB/s | 197 kB 00:00 (107/171): libXaw-1.0.13-10.el8.x86_64.rpm 2.6 MB/s | 194 kB 00:00 (108/171): libmcpp-2.7.2-20.el8.x86_64.rpm 220 kB/s | 81 kB 00:00 (109/171): libXfixes-5.0.3-7.el8.x86_64.rpm 209 kB/s | 25 kB 00:00 (110/171): libXxf86vm-1.1.4-9.el8.x86_64.rpm 143 kB/s | 19 kB 00:00 (111/171): libXdamage-1.1.4-14.el8.x86_64.rpm 131 kB/s | 27 kB 00:00 (112/171): libidn-1.34-5.el8.x86_64.rpm 3.4 MB/s | 239 kB 00:00 (113/171): libijs-0.35-5.el8.x86_64.rpm 512 kB/s | 30 kB 00:00 (114/171): atk-2.28.1-1.el8.x86_64.rpm 4.8 MB/s | 272 kB 00:00 (115/171): libthai-0.1.27-2.el8.x86_64.rpm 2.8 MB/s | 203 kB 00:00 (116/171): harfbuzz-1.7.5-3.el8.x86_64.rpm 5.2 MB/s | 294 kB 00:00 (117/171): libXrender-0.9.10-7.el8.x86_64.rpm 588 kB/s | 33 kB 00:00 (118/171): libdatrie-0.2.9-7.el8.x86_64.rpm 584 kB/s | 33 kB 00:00 (119/171): libXcomposite-0.4.4-14.el8.x86_64.rp 280 kB/s | 29 kB 00:00 (120/171): libfontenc-1.1.3-8.el8.x86_64.rpm 604 kB/s | 37 kB 00:00 (121/171): libpaper-1.1.24-22.el8.x86_64.rpm 821 kB/s | 45 kB 00:00 (122/171): libXt-1.1.5-12.el8.x86_64.rpm 2.1 MB/s | 185 kB 00:00 (123/171): libICE-1.0.9-15.el8.x86_64.rpm 1.1 MB/s | 74 kB 00:00 (124/171): gdk-pixbuf2-modules-2.36.12-5.el8.x8 1.0 MB/s | 109 kB 00:00 (125/171): libxcb-1.13.1-1.el8.x86_64.rpm 2.5 MB/s | 229 kB 00:00 (126/171): perl-Mozilla-CA-20160104-7.module+el 201 kB/s | 15 kB 00:00 (127/171): perl-IO-Socket-SSL-2.066-4.module+el 2.6 MB/s | 298 kB 00:00 (128/171): libXi-1.7.10-1.el8.x86_64.rpm 620 kB/s | 49 kB 00:00 (129/171): libXext-1.3.4-1.el8.x86_64.rpm 321 kB/s | 45 kB 00:00 (130/171): gd-2.2.5-7.el8.x86_64.rpm 1.3 MB/s | 144 kB 00:00 (131/171): libXau-1.0.9-3.el8.x86_64.rpm 696 kB/s | 37 kB 00:00 (132/171): libXmu-1.1.3-1.el8.x86_64.rpm 1.2 MB/s | 75 kB 00:00 (133/171): libXft-2.3.3-1.el8.x86_64.rpm 616 kB/s | 67 kB 00:00 (134/171): libXrandr-1.5.2-1.el8.x86_64.rpm 364 kB/s | 34 kB 00:00 (135/171): gtk2-2.24.32-5.el8.x86_64.rpm 35 MB/s | 3.4 MB 00:00 (136/171): jbig2dec-libs-0.16-1.el8.x86_64.rpm 1.1 MB/s | 72 kB 00:00 (137/171): libjpeg-turbo-1.5.3-12.el8.x86_64.rp 1.8 MB/s | 157 kB 00:00 (138/171): pango-1.42.4-8.el8.x86_64.rpm 2.7 MB/s | 297 kB 00:00 (139/171): libuv-1.41.1-1.el8_4.x86_64.rpm 891 kB/s | 156 kB 00:00 (140/171): jasper-libs-2.0.14-5.el8.x86_64.rpm 2.9 MB/s | 167 kB 00:00 (141/171): xorg-x11-font-utils-7.5-41.el8.x86_6 913 kB/s | 104 kB 00:00 (142/171): cairo-1.15.12-6.el8.x86_64.rpm 11 MB/s | 719 kB 00:00 (143/171): perl-Net-SSLeay-1.88-2.module+el8.6. 2.7 MB/s | 379 kB 00:00 (144/171): openjpeg2-2.4.0-5.el8.x86_64.rpm 2.1 MB/s | 165 kB 00:00 (145/171): fribidi-1.0.4-9.el8.x86_64.rpm 804 kB/s | 89 kB 00:00 (146/171): gtk-update-icon-cache-3.22.30-11.el8 435 kB/s | 33 kB 00:00 (147/171): vim-filesystem-8.0.1763-19.el8_6.4.n 226 kB/s | 50 kB 00:00 (148/171): python3-rpm-generators-5-8.el8.noarc 387 kB/s | 25 kB 00:00 (149/171): libXpm-3.5.12-9.el8_7.x86_64.rpm 379 kB/s | 58 kB 00:00 (150/171): git-2.39.3-1.el8_8.x86_64.rpm 988 kB/s | 104 kB 00:00 (151/171): git-core-doc-2.39.3-1.el8_8.noarch.r 29 MB/s | 3.0 MB 00:00 (152/171): graphviz-2.40.1-44.el8.x86_64.rpm 25 MB/s | 1.8 MB 00:00 (153/171): git-core-2.39.3-1.el8_8.x86_64.rpm 66 MB/s | 11 MB 00:00 (154/171): libtiff-4.0.9-29.el8_8.x86_64.rpm 2.1 MB/s | 189 kB 00:00 (155/171): perl-Git-2.39.3-1.el8_8.noarch.rpm 759 kB/s | 79 kB 00:00 (156/171): libwebp-1.0.0-9.el8_9.1.x86_64.rpm 3.4 MB/s | 274 kB 00:00 (157/171): libgs-9.27-11.el8.x86_64.rpm 29 MB/s | 3.1 MB 00:00 (158/171): libX11-common-1.6.8-6.el8.noarch.rpm 757 kB/s | 158 kB 00:00 (159/171): libX11-1.6.8-6.el8.x86_64.rpm 9.1 MB/s | 611 kB 00:00 (160/171): cmake-data-3.26.5-1.el8_9.noarch.rpm 25 MB/s | 1.9 MB 00:00 (161/171): librsvg2-2.42.7-5.el8.x86_64.rpm 4.4 MB/s | 575 kB 00:00 (162/171): cmake-rpm-macros-3.26.5-1.el8_9.noar 652 kB/s | 44 kB 00:00 (163/171): cmake-filesystem-3.26.5-1.el8_9.x86_ 511 kB/s | 45 kB 00:00 (164/171): cmake-3.26.5-1.el8_9.x86_64.rpm 62 MB/s | 14 MB 00:00 (165/171): pixman-0.38.4-3.el8_9.x86_64.rpm 3.3 MB/s | 258 kB 00:00 (166/171): python36-devel-3.6.8-38.module+el8.9 293 kB/s | 17 kB 00:00 (167/171): platform-python-devel-3.6.8-56.el8_9 2.0 MB/s | 241 kB 00:00 (168/171): python36-3.6.8-38.module+el8.9.0+209 165 kB/s | 19 kB 00:00 (169/171): python36-rpm-macros-3.6.8-38.module+ 251 kB/s | 16 kB 00:00 (170/171): python3-pip-9.0.3-23.el8_9.1.noarch. 172 kB/s | 20 kB 00:00 (171/171): doxygen-1.8.14-12.el8.x86_64.rpm 31 MB/s | 3.9 MB 00:00 -------------------------------------------------------------------------------- Total 178 MB/s | 1.5 GB 00:08 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : libpng-2:1.6.34-5.el8.x86_64 1/171 Installing : freetype-2.9.1-9.el8.x86_64 2/171 Installing : libjpeg-turbo-1.5.3-12.el8.x86_64 3/171 Installing : libICE-1.0.9-15.el8.x86_64 4/171 Installing : emacs-filesystem-1:26.1-11.el8.noarch 5/171 Installing : fontpackages-filesystem-1.44-22.el8.noarch 6/171 Installing : urw-base35-fonts-common-20170801-10.el8.noarch 7/171 Installing : cuda-toolkit-config-common-12.4.127-1.noarch 8/171 Installing : cuda-toolkit-12-config-common-12.4.127-1.noarch 9/171 Installing : cuda-toolkit-12-4-config-common-12.4.127-1.noarc 10/171 Installing : google-droid-sans-fonts-20120715-13.el8.noarch 11/171 Installing : fontconfig-2.13.1-4.el8.x86_64 12/171 Running scriptlet: fontconfig-2.13.1-4.el8.x86_64 12/171 Installing : libSM-1.2.3-1.el8.x86_64 13/171 Installing : cmake-rpm-macros-3.26.5-1.el8_9.noarch 14/171 Installing : cmake-filesystem-3.26.5-1.el8_9.x86_64 15/171 Installing : atk-2.28.1-1.el8.x86_64 16/171 Installing : adobe-mappings-cmap-20171205-3.el8.noarch 17/171 Installing : adobe-mappings-cmap-deprecated-20171205-3.el8.no 18/171 Installing : cuda-cudart-12-4-12.4.127-1.x86_64 19/171 Running scriptlet: cuda-cudart-12-4-12.4.127-1.x86_64 19/171 Installing : libcublas-12-4-12.4.5.8-1.x86_64 20/171 Running scriptlet: libcublas-12-4-12.4.5.8-1.x86_64 20/171 Installing : libcurand-12-4-10.3.5.147-1.x86_64 21/171 Running scriptlet: libcurand-12-4-10.3.5.147-1.x86_64 21/171 Installing : libidn-1.34-5.el8.x86_64 22/171 Running scriptlet: libidn-1.34-5.el8.x86_64 22/171 Installing : jasper-libs-2.0.14-5.el8.x86_64 23/171 Installing : pixman-0.38.4-3.el8_9.x86_64 24/171 Installing : libwebp-1.0.0-9.el8_9.1.x86_64 25/171 Installing : libX11-common-1.6.8-6.el8.noarch 26/171 Installing : python3-rpm-generators-5-8.el8.noarch 27/171 Installing : platform-python-devel-3.6.8-56.el8_9.3.x86_64 28/171 Installing : openjpeg2-2.4.0-5.el8.x86_64 29/171 Installing : fribidi-1.0.4-9.el8.x86_64 30/171 Installing : vim-filesystem-2:8.0.1763-19.el8_6.4.noarch 31/171 Installing : libuv-1:1.41.1-1.el8_4.x86_64 32/171 Installing : cmake-3.26.5-1.el8_9.x86_64 33/171 Installing : cmake-data-3.26.5-1.el8_9.noarch 34/171 Installing : jbig2dec-libs-0.16-1.el8.x86_64 35/171 Running scriptlet: jbig2dec-libs-0.16-1.el8.x86_64 35/171 Installing : libXau-1.0.9-3.el8.x86_64 36/171 Installing : libxcb-1.13.1-1.el8.x86_64 37/171 Installing : libX11-1.6.8-6.el8.x86_64 38/171 Installing : libXext-1.3.4-1.el8.x86_64 39/171 Installing : libXrender-0.9.10-7.el8.x86_64 40/171 Installing : cairo-1.15.12-6.el8.x86_64 41/171 Installing : libXt-1.1.5-12.el8.x86_64 42/171 Installing : libXmu-1.1.3-1.el8.x86_64 43/171 Installing : libXfixes-5.0.3-7.el8.x86_64 44/171 Installing : libXpm-3.5.12-9.el8_7.x86_64 45/171 Installing : libXcursor-1.1.15-3.el8.x86_64 46/171 Installing : libXrandr-1.5.2-1.el8.x86_64 47/171 Installing : libXinerama-1.1.4-1.el8.x86_64 48/171 Installing : libXi-1.7.10-1.el8.x86_64 49/171 Installing : libXaw-1.0.13-10.el8.x86_64 50/171 Installing : libXdamage-1.1.4-14.el8.x86_64 51/171 Installing : libXft-2.3.3-1.el8.x86_64 52/171 Installing : libXxf86misc-1.0.4-1.el8.x86_64 53/171 Installing : libXxf86vm-1.1.4-9.el8.x86_64 54/171 Installing : libXcomposite-0.4.4-14.el8.x86_64 55/171 Installing : libpaper-1.1.24-22.el8.x86_64 56/171 Installing : libfontenc-1.1.3-8.el8.x86_64 57/171 Installing : xorg-x11-font-utils-1:7.5-41.el8.x86_64 58/171 Installing : xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 59/171 Running scriptlet: xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 59/171 Installing : libdatrie-0.2.9-7.el8.x86_64 60/171 Running scriptlet: libdatrie-0.2.9-7.el8.x86_64 60/171 Installing : libthai-0.1.27-2.el8.x86_64 61/171 Running scriptlet: libthai-0.1.27-2.el8.x86_64 61/171 Installing : libijs-0.35-5.el8.x86_64 62/171 Installing : libmcpp-2.7.2-20.el8.x86_64 63/171 Running scriptlet: libmcpp-2.7.2-20.el8.x86_64 63/171 Installing : mcpp-2.7.2-20.el8.x86_64 64/171 Installing : xorg-x11-server-utils-7.7-27.el8.x86_64 65/171 Installing : urw-base35-gothic-fonts-20170801-10.el8.noarch 66/171 Running scriptlet: urw-base35-gothic-fonts-20170801-10.el8.noarch 66/171 Installing : urw-base35-p052-fonts-20170801-10.el8.noarch 67/171 Running scriptlet: urw-base35-p052-fonts-20170801-10.el8.noarch 67/171 Installing : urw-base35-bookman-fonts-20170801-10.el8.noarch 68/171 Running scriptlet: urw-base35-bookman-fonts-20170801-10.el8.noarch 68/171 Installing : urw-base35-c059-fonts-20170801-10.el8.noarch 69/171 Running scriptlet: urw-base35-c059-fonts-20170801-10.el8.noarch 69/171 Installing : urw-base35-d050000l-fonts-20170801-10.el8.noarch 70/171 Running scriptlet: urw-base35-d050000l-fonts-20170801-10.el8.noarch 70/171 Installing : urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 71/171 Running scriptlet: urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 71/171 Installing : urw-base35-nimbus-roman-fonts-20170801-10.el8.no 72/171 Running scriptlet: urw-base35-nimbus-roman-fonts-20170801-10.el8.no 72/171 Installing : urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 73/171 Running scriptlet: urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 73/171 Installing : urw-base35-standard-symbols-ps-fonts-20170801-10 74/171 Running scriptlet: urw-base35-standard-symbols-ps-fonts-20170801-10 74/171 Installing : urw-base35-z003-fonts-20170801-10.el8.noarch 75/171 Running scriptlet: urw-base35-z003-fonts-20170801-10.el8.noarch 75/171 Installing : urw-base35-fonts-20170801-10.el8.noarch 76/171 Installing : jbigkit-libs-2.1-14.el8.x86_64 77/171 Running scriptlet: jbigkit-libs-2.1-14.el8.x86_64 77/171 Installing : libtiff-4.0.9-29.el8_8.x86_64 78/171 Installing : gd-2.2.5-7.el8.x86_64 79/171 Running scriptlet: gd-2.2.5-7.el8.x86_64 79/171 Installing : graphite2-1.3.10-10.el8.x86_64 80/171 Installing : harfbuzz-1.7.5-3.el8.x86_64 81/171 Running scriptlet: harfbuzz-1.7.5-3.el8.x86_64 81/171 Installing : pango-1.42.4-8.el8.x86_64 82/171 Running scriptlet: pango-1.42.4-8.el8.x86_64 82/171 Installing : lcms2-2.9-2.el8.x86_64 83/171 Running scriptlet: lcms2-2.9-2.el8.x86_64 83/171 Installing : hicolor-icon-theme-0.17-2.el8.noarch 84/171 Installing : adobe-mappings-pdf-20180407-1.el8.noarch 85/171 Installing : platform-python-pip-9.0.3-23.el8_9.1.noarch 86/171 Installing : less-530-2.el8_9.x86_64 87/171 Running scriptlet: openssh-8.0p1-19.el8_9.2.x86_64 88/171 Installing : openssh-8.0p1-19.el8_9.2.x86_64 88/171 Installing : openssl-1:1.1.1k-12.el8_9.x86_64 89/171 Installing : dbus-libs-1:1.12.8-26.el8.x86_64 90/171 Running scriptlet: dbus-libs-1:1.12.8-26.el8.x86_64 90/171 Installing : avahi-libs-0.7-21.el8_9.1.x86_64 91/171 Installing : cups-libs-1:2.2.6-54.el8_9.x86_64 92/171 Installing : libgs-9.27-11.el8.x86_64 93/171 Installing : python3-setuptools-39.2.0-7.el8.noarch 94/171 Installing : python3-pip-9.0.3-23.el8_9.1.noarch 95/171 Installing : python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 96/171 Running scriptlet: python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 96/171 Installing : libcroco-0.6.12-4.el8_2.1.x86_64 97/171 Running scriptlet: libcroco-0.6.12-4.el8_2.1.x86_64 97/171 Installing : shared-mime-info-1.9-3.el8.x86_64 98/171 Running scriptlet: shared-mime-info-1.9-3.el8.x86_64 98/171 Installing : gdk-pixbuf2-2.36.12-5.el8.x86_64 99/171 Running scriptlet: gdk-pixbuf2-2.36.12-5.el8.x86_64 99/171 Installing : gdk-pixbuf2-modules-2.36.12-5.el8.x86_64 100/171 Installing : gtk-update-icon-cache-3.22.30-11.el8.x86_64 101/171 Installing : gtk2-2.24.32-5.el8.x86_64 102/171 Running scriptlet: gtk2-2.24.32-5.el8.x86_64 102/171 Installing : librsvg2-2.42.7-5.el8.x86_64 103/171 Installing : libedit-3.1-23.20170329cvs.el8.x86_64 104/171 Installing : openssh-clients-8.0p1-19.el8_9.2.x86_64 105/171 Installing : git-core-2.39.3-1.el8_8.x86_64 106/171 Installing : git-core-doc-2.39.3-1.el8_8.noarch 107/171 Installing : groff-base-1.22.3-18.el8.x86_64 108/171 Installing : perl-Digest-1.17-395.el8.noarch 109/171 Installing : perl-Digest-MD5-2.55-396.el8.x86_64 110/171 Installing : perl-Data-Dumper-2.167-399.el8.x86_64 111/171 Installing : perl-libnet-3.11-3.el8.noarch 112/171 Installing : perl-URI-1.73-3.el8.noarch 113/171 Installing : perl-Pod-Escapes-1:1.07-395.el8.noarch 114/171 Installing : perl-Time-Local-1:1.280-1.el8.noarch 115/171 Installing : perl-IO-Socket-IP-0.39-5.el8.noarch 116/171 Installing : perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9 117/171 Installing : perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f089 118/171 Installing : perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+5 119/171 Installing : perl-Term-ANSIColor-4.06-396.el8.noarch 120/171 Installing : perl-Term-Cap-1.17-395.el8.noarch 121/171 Installing : perl-File-Temp-0.230.600-1.el8.noarch 122/171 Installing : perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 123/171 Installing : perl-Pod-Simple-1:3.35-395.el8.noarch 124/171 Installing : perl-podlators-4.11-1.el8.noarch 125/171 Installing : perl-Pod-Perldoc-3.28-396.el8.noarch 126/171 Installing : perl-Text-ParseWords-3.30-395.el8.noarch 127/171 Installing : perl-Pod-Usage-4:1.69-395.el8.noarch 128/171 Installing : perl-MIME-Base64-3.15-396.el8.x86_64 129/171 Installing : perl-Storable-1:3.11-3.el8.x86_64 130/171 Installing : perl-Getopt-Long-1:2.50-4.el8.noarch 131/171 Installing : perl-Socket-4:2.027-3.el8.x86_64 132/171 Installing : perl-Errno-1.28-422.el8.x86_64 133/171 Installing : perl-Encode-4:2.97-3.el8.x86_64 134/171 Installing : perl-Scalar-List-Utils-3:1.49-2.el8.x86_64 135/171 Installing : perl-Carp-1.42-396.el8.noarch 136/171 Installing : perl-Exporter-5.72-396.el8.noarch 137/171 Installing : perl-libs-4:5.26.3-422.el8.x86_64 138/171 Installing : perl-parent-1:0.237-1.el8.noarch 139/171 Installing : perl-macros-4:5.26.3-422.el8.x86_64 140/171 Installing : perl-Unicode-Normalize-1.25-396.el8.x86_64 141/171 Installing : perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch 142/171 Installing : perl-constant-1.33-396.el8.noarch 143/171 Installing : perl-PathTools-3.74-1.el8.x86_64 144/171 Installing : perl-threads-shared-1.58-2.el8.x86_64 145/171 Installing : perl-threads-1:2.21-2.el8.x86_64 146/171 Installing : perl-File-Path-2.15-2.el8.noarch 147/171 Installing : perl-IO-1.38-422.el8.x86_64 148/171 Installing : perl-interpreter-4:5.26.3-422.el8.x86_64 149/171 Installing : perl-Error-1:0.17025-2.el8.noarch 150/171 Installing : perl-TermReadKey-2.37-7.el8.x86_64 151/171 Installing : perl-Git-2.39.3-1.el8_8.noarch 152/171 Installing : git-2.39.3-1.el8_8.x86_64 153/171 Installing : cuda-nvvm-12-4-12.4.131-1.x86_64 154/171 Installing : cuda-nvrtc-12-4-12.4.127-1.x86_64 155/171 Running scriptlet: cuda-nvrtc-12-4-12.4.127-1.x86_64 155/171 Installing : cuda-crt-12-4-12.4.131-1.x86_64 156/171 Installing : cuda-cccl-12-4-12.4.127-1.x86_64 157/171 Installing : libcudnn8-8.9.7.29-2.cuda12.3.x86_64 158/171 Installing : libcudnn8-devel-8.9.7.29-2.cuda12.3.x86_64 159/171 Running scriptlet: libcudnn8-devel-8.9.7.29-2.cuda12.3.x86_64 159/171 Installing : cuda-cudart-devel-12-4-12.4.127-1.x86_64 160/171 Installing : cuda-nvcc-12-4-12.4.131-1.x86_64 161/171 Installing : cuda-nvrtc-devel-12-4-12.4.127-1.x86_64 162/171 Installing : doxygen-1:1.8.14-12.el8.x86_64 163/171 Installing : graphviz-2.40.1-44.el8.x86_64 164/171 Running scriptlet: graphviz-2.40.1-44.el8.x86_64 164/171 Installing : python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 165/171 Running scriptlet: python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 165/171 Installing : libcurand-devel-12-4-10.3.5.147-1.x86_64 166/171 Installing : libcublas-devel-12-4-12.4.5.8-1.x86_64 167/171 Installing : python36-rpm-macros-3.6.8-38.module+el8.9.0+2097 168/171 Installing : cuda-nvtx-12-4-12.4.127-1.x86_64 169/171 Installing : cuda-nvml-devel-12-4-12.4.127-1.x86_64 170/171 Installing : cuda-driver-devel-12-4-12.4.127-1.x86_64 171/171 Running scriptlet: cuda-toolkit-12-4-config-common-12.4.127-1.noarc 171/171 Running scriptlet: urw-base35-gothic-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-p052-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-bookman-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-c059-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-d050000l-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 171/171 Running scriptlet: urw-base35-nimbus-roman-fonts-20170801-10.el8.no 171/171 Running scriptlet: urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 171/171 Running scriptlet: urw-base35-standard-symbols-ps-fonts-20170801-10 171/171 Running scriptlet: urw-base35-z003-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: cuda-driver-devel-12-4-12.4.127-1.x86_64 171/171 Running scriptlet: fontconfig-2.13.1-4.el8.x86_64 171/171 Running scriptlet: hicolor-icon-theme-0.17-2.el8.noarch 171/171 Running scriptlet: shared-mime-info-1.9-3.el8.x86_64 171/171 Running scriptlet: gdk-pixbuf2-2.36.12-5.el8.x86_64 171/171 Verifying : libcudnn8-8.9.7.29-2.cuda12.3.x86_64 1/171 Verifying : libcudnn8-devel-8.9.7.29-2.cuda12.3.x86_64 2/171 Verifying : cuda-cccl-12-4-12.4.127-1.x86_64 3/171 Verifying : cuda-crt-12-4-12.4.131-1.x86_64 4/171 Verifying : cuda-cudart-12-4-12.4.127-1.x86_64 5/171 Verifying : cuda-cudart-devel-12-4-12.4.127-1.x86_64 6/171 Verifying : cuda-driver-devel-12-4-12.4.127-1.x86_64 7/171 Verifying : cuda-nvcc-12-4-12.4.131-1.x86_64 8/171 Verifying : cuda-nvml-devel-12-4-12.4.127-1.x86_64 9/171 Verifying : cuda-nvrtc-12-4-12.4.127-1.x86_64 10/171 Verifying : cuda-nvrtc-devel-12-4-12.4.127-1.x86_64 11/171 Verifying : cuda-nvtx-12-4-12.4.127-1.x86_64 12/171 Verifying : cuda-nvvm-12-4-12.4.131-1.x86_64 13/171 Verifying : cuda-toolkit-12-4-config-common-12.4.127-1.noarc 14/171 Verifying : cuda-toolkit-12-config-common-12.4.127-1.noarch 15/171 Verifying : cuda-toolkit-config-common-12.4.127-1.noarch 16/171 Verifying : libcublas-12-4-12.4.5.8-1.x86_64 17/171 Verifying : libcublas-devel-12-4-12.4.5.8-1.x86_64 18/171 Verifying : libcurand-12-4-10.3.5.147-1.x86_64 19/171 Verifying : libcurand-devel-12-4-10.3.5.147-1.x86_64 20/171 Verifying : groff-base-1.22.3-18.el8.x86_64 21/171 Verifying : libedit-3.1-23.20170329cvs.el8.x86_64 22/171 Verifying : libpng-2:1.6.34-5.el8.x86_64 23/171 Verifying : perl-Data-Dumper-2.167-399.el8.x86_64 24/171 Verifying : perl-Encode-4:2.97-3.el8.x86_64 25/171 Verifying : perl-MIME-Base64-3.15-396.el8.x86_64 26/171 Verifying : perl-PathTools-3.74-1.el8.x86_64 27/171 Verifying : perl-Scalar-List-Utils-3:1.49-2.el8.x86_64 28/171 Verifying : perl-Unicode-Normalize-1.25-396.el8.x86_64 29/171 Verifying : perl-threads-shared-1.58-2.el8.x86_64 30/171 Verifying : shared-mime-info-1.9-3.el8.x86_64 31/171 Verifying : fontpackages-filesystem-1.44-22.el8.noarch 32/171 Verifying : perl-Carp-1.42-396.el8.noarch 33/171 Verifying : perl-Exporter-5.72-396.el8.noarch 34/171 Verifying : perl-File-Path-2.15-2.el8.noarch 35/171 Verifying : perl-File-Temp-0.230.600-1.el8.noarch 36/171 Verifying : perl-Getopt-Long-1:2.50-4.el8.noarch 37/171 Verifying : perl-Pod-Escapes-1:1.07-395.el8.noarch 38/171 Verifying : perl-Pod-Perldoc-3.28-396.el8.noarch 39/171 Verifying : perl-Pod-Simple-1:3.35-395.el8.noarch 40/171 Verifying : perl-Pod-Usage-4:1.69-395.el8.noarch 41/171 Verifying : perl-Storable-1:3.11-3.el8.x86_64 42/171 Verifying : perl-Term-ANSIColor-4.06-396.el8.noarch 43/171 Verifying : perl-Term-Cap-1.17-395.el8.noarch 44/171 Verifying : perl-Text-ParseWords-3.30-395.el8.noarch 45/171 Verifying : perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch 46/171 Verifying : perl-Time-Local-1:1.280-1.el8.noarch 47/171 Verifying : perl-constant-1.33-396.el8.noarch 48/171 Verifying : perl-parent-1:0.237-1.el8.noarch 49/171 Verifying : perl-podlators-4.11-1.el8.noarch 50/171 Verifying : perl-threads-1:2.21-2.el8.x86_64 51/171 Verifying : gdk-pixbuf2-2.36.12-5.el8.x86_64 52/171 Verifying : perl-Socket-4:2.027-3.el8.x86_64 53/171 Verifying : libcroco-0.6.12-4.el8_2.1.x86_64 54/171 Verifying : fontconfig-2.13.1-4.el8.x86_64 55/171 Verifying : freetype-2.9.1-9.el8.x86_64 56/171 Verifying : perl-Errno-1.28-422.el8.x86_64 57/171 Verifying : perl-IO-1.38-422.el8.x86_64 58/171 Verifying : perl-interpreter-4:5.26.3-422.el8.x86_64 59/171 Verifying : perl-libs-4:5.26.3-422.el8.x86_64 60/171 Verifying : perl-macros-4:5.26.3-422.el8.x86_64 61/171 Verifying : python3-setuptools-39.2.0-7.el8.noarch 62/171 Verifying : dbus-libs-1:1.12.8-26.el8.x86_64 63/171 Verifying : emacs-filesystem-1:26.1-11.el8.noarch 64/171 Verifying : perl-Digest-MD5-2.55-396.el8.x86_64 65/171 Verifying : perl-URI-1.73-3.el8.noarch 66/171 Verifying : perl-libnet-3.11-3.el8.noarch 67/171 Verifying : avahi-libs-0.7-21.el8_9.1.x86_64 68/171 Verifying : cups-libs-1:2.2.6-54.el8_9.x86_64 69/171 Verifying : openssl-1:1.1.1k-12.el8_9.x86_64 70/171 Verifying : perl-Digest-1.17-395.el8.noarch 71/171 Verifying : perl-IO-Socket-IP-0.39-5.el8.noarch 72/171 Verifying : openssh-8.0p1-19.el8_9.2.x86_64 73/171 Verifying : openssh-clients-8.0p1-19.el8_9.2.x86_64 74/171 Verifying : less-530-2.el8_9.x86_64 75/171 Verifying : perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 76/171 Verifying : platform-python-pip-9.0.3-23.el8_9.1.noarch 77/171 Verifying : google-droid-sans-fonts-20120715-13.el8.noarch 78/171 Verifying : urw-base35-fonts-20170801-10.el8.noarch 79/171 Verifying : urw-base35-gothic-fonts-20170801-10.el8.noarch 80/171 Verifying : urw-base35-p052-fonts-20170801-10.el8.noarch 81/171 Verifying : xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 82/171 Verifying : adobe-mappings-cmap-20171205-3.el8.noarch 83/171 Verifying : adobe-mappings-cmap-deprecated-20171205-3.el8.no 84/171 Verifying : adobe-mappings-pdf-20180407-1.el8.noarch 85/171 Verifying : hicolor-icon-theme-0.17-2.el8.noarch 86/171 Verifying : lcms2-2.9-2.el8.x86_64 87/171 Verifying : perl-Error-1:0.17025-2.el8.noarch 88/171 Verifying : perl-TermReadKey-2.37-7.el8.x86_64 89/171 Verifying : urw-base35-bookman-fonts-20170801-10.el8.noarch 90/171 Verifying : urw-base35-c059-fonts-20170801-10.el8.noarch 91/171 Verifying : urw-base35-d050000l-fonts-20170801-10.el8.noarch 92/171 Verifying : urw-base35-fonts-common-20170801-10.el8.noarch 93/171 Verifying : urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 94/171 Verifying : urw-base35-nimbus-roman-fonts-20170801-10.el8.no 95/171 Verifying : urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 96/171 Verifying : urw-base35-standard-symbols-ps-fonts-20170801-10 97/171 Verifying : urw-base35-z003-fonts-20170801-10.el8.noarch 98/171 Verifying : graphite2-1.3.10-10.el8.x86_64 99/171 Verifying : jbigkit-libs-2.1-14.el8.x86_64 100/171 Verifying : libXcursor-1.1.15-3.el8.x86_64 101/171 Verifying : libXinerama-1.1.4-1.el8.x86_64 102/171 Verifying : libXxf86misc-1.0.4-1.el8.x86_64 103/171 Verifying : libmcpp-2.7.2-20.el8.x86_64 104/171 Verifying : mcpp-2.7.2-20.el8.x86_64 105/171 Verifying : xorg-x11-server-utils-7.7-27.el8.x86_64 106/171 Verifying : libSM-1.2.3-1.el8.x86_64 107/171 Verifying : libXaw-1.0.13-10.el8.x86_64 108/171 Verifying : libXdamage-1.1.4-14.el8.x86_64 109/171 Verifying : libXfixes-5.0.3-7.el8.x86_64 110/171 Verifying : libXxf86vm-1.1.4-9.el8.x86_64 111/171 Verifying : libidn-1.34-5.el8.x86_64 112/171 Verifying : libijs-0.35-5.el8.x86_64 113/171 Verifying : libthai-0.1.27-2.el8.x86_64 114/171 Verifying : atk-2.28.1-1.el8.x86_64 115/171 Verifying : harfbuzz-1.7.5-3.el8.x86_64 116/171 Verifying : libXcomposite-0.4.4-14.el8.x86_64 117/171 Verifying : libXrender-0.9.10-7.el8.x86_64 118/171 Verifying : libdatrie-0.2.9-7.el8.x86_64 119/171 Verifying : libfontenc-1.1.3-8.el8.x86_64 120/171 Verifying : libpaper-1.1.24-22.el8.x86_64 121/171 Verifying : libXt-1.1.5-12.el8.x86_64 122/171 Verifying : gdk-pixbuf2-modules-2.36.12-5.el8.x86_64 123/171 Verifying : libICE-1.0.9-15.el8.x86_64 124/171 Verifying : libxcb-1.13.1-1.el8.x86_64 125/171 Verifying : perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+5 126/171 Verifying : perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9 127/171 Verifying : libXext-1.3.4-1.el8.x86_64 128/171 Verifying : libXi-1.7.10-1.el8.x86_64 129/171 Verifying : gd-2.2.5-7.el8.x86_64 130/171 Verifying : libXau-1.0.9-3.el8.x86_64 131/171 Verifying : libXft-2.3.3-1.el8.x86_64 132/171 Verifying : libXmu-1.1.3-1.el8.x86_64 133/171 Verifying : libXrandr-1.5.2-1.el8.x86_64 134/171 Verifying : gtk2-2.24.32-5.el8.x86_64 135/171 Verifying : jbig2dec-libs-0.16-1.el8.x86_64 136/171 Verifying : libuv-1:1.41.1-1.el8_4.x86_64 137/171 Verifying : libjpeg-turbo-1.5.3-12.el8.x86_64 138/171 Verifying : pango-1.42.4-8.el8.x86_64 139/171 Verifying : xorg-x11-font-utils-1:7.5-41.el8.x86_64 140/171 Verifying : jasper-libs-2.0.14-5.el8.x86_64 141/171 Verifying : perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f089 142/171 Verifying : cairo-1.15.12-6.el8.x86_64 143/171 Verifying : vim-filesystem-2:8.0.1763-19.el8_6.4.noarch 144/171 Verifying : fribidi-1.0.4-9.el8.x86_64 145/171 Verifying : openjpeg2-2.4.0-5.el8.x86_64 146/171 Verifying : gtk-update-icon-cache-3.22.30-11.el8.x86_64 147/171 Verifying : libXpm-3.5.12-9.el8_7.x86_64 148/171 Verifying : python3-rpm-generators-5-8.el8.noarch 149/171 Verifying : git-2.39.3-1.el8_8.x86_64 150/171 Verifying : git-core-2.39.3-1.el8_8.x86_64 151/171 Verifying : git-core-doc-2.39.3-1.el8_8.noarch 152/171 Verifying : graphviz-2.40.1-44.el8.x86_64 153/171 Verifying : perl-Git-2.39.3-1.el8_8.noarch 154/171 Verifying : libtiff-4.0.9-29.el8_8.x86_64 155/171 Verifying : libX11-common-1.6.8-6.el8.noarch 156/171 Verifying : libgs-9.27-11.el8.x86_64 157/171 Verifying : libwebp-1.0.0-9.el8_9.1.x86_64 158/171 Verifying : libX11-1.6.8-6.el8.x86_64 159/171 Verifying : librsvg2-2.42.7-5.el8.x86_64 160/171 Verifying : cmake-3.26.5-1.el8_9.x86_64 161/171 Verifying : cmake-data-3.26.5-1.el8_9.noarch 162/171 Verifying : cmake-filesystem-3.26.5-1.el8_9.x86_64 163/171 Verifying : cmake-rpm-macros-3.26.5-1.el8_9.noarch 164/171 Verifying : pixman-0.38.4-3.el8_9.x86_64 165/171 Verifying : platform-python-devel-3.6.8-56.el8_9.3.x86_64 166/171 Verifying : python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 167/171 Verifying : python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 168/171 Verifying : python36-rpm-macros-3.6.8-38.module+el8.9.0+2097 169/171 Verifying : python3-pip-9.0.3-23.el8_9.1.noarch 170/171 Verifying : doxygen-1:1.8.14-12.el8.x86_64 171/171 Installed products updated. Installed: adobe-mappings-cmap-20171205-3.el8.noarch adobe-mappings-cmap-deprecated-20171205-3.el8.noarch adobe-mappings-pdf-20180407-1.el8.noarch atk-2.28.1-1.el8.x86_64 avahi-libs-0.7-21.el8_9.1.x86_64 cairo-1.15.12-6.el8.x86_64 cmake-3.26.5-1.el8_9.x86_64 cmake-data-3.26.5-1.el8_9.noarch cmake-filesystem-3.26.5-1.el8_9.x86_64 cmake-rpm-macros-3.26.5-1.el8_9.noarch cuda-cccl-12-4-12.4.127-1.x86_64 cuda-crt-12-4-12.4.131-1.x86_64 cuda-cudart-12-4-12.4.127-1.x86_64 cuda-cudart-devel-12-4-12.4.127-1.x86_64 cuda-driver-devel-12-4-12.4.127-1.x86_64 cuda-nvcc-12-4-12.4.131-1.x86_64 cuda-nvml-devel-12-4-12.4.127-1.x86_64 cuda-nvrtc-12-4-12.4.127-1.x86_64 cuda-nvrtc-devel-12-4-12.4.127-1.x86_64 cuda-nvtx-12-4-12.4.127-1.x86_64 cuda-nvvm-12-4-12.4.131-1.x86_64 cuda-toolkit-12-4-config-common-12.4.127-1.noarch cuda-toolkit-12-config-common-12.4.127-1.noarch cuda-toolkit-config-common-12.4.127-1.noarch cups-libs-1:2.2.6-54.el8_9.x86_64 dbus-libs-1:1.12.8-26.el8.x86_64 doxygen-1:1.8.14-12.el8.x86_64 emacs-filesystem-1:26.1-11.el8.noarch fontconfig-2.13.1-4.el8.x86_64 fontpackages-filesystem-1.44-22.el8.noarch freetype-2.9.1-9.el8.x86_64 fribidi-1.0.4-9.el8.x86_64 gd-2.2.5-7.el8.x86_64 gdk-pixbuf2-2.36.12-5.el8.x86_64 gdk-pixbuf2-modules-2.36.12-5.el8.x86_64 git-2.39.3-1.el8_8.x86_64 git-core-2.39.3-1.el8_8.x86_64 git-core-doc-2.39.3-1.el8_8.noarch google-droid-sans-fonts-20120715-13.el8.noarch graphite2-1.3.10-10.el8.x86_64 graphviz-2.40.1-44.el8.x86_64 groff-base-1.22.3-18.el8.x86_64 gtk-update-icon-cache-3.22.30-11.el8.x86_64 gtk2-2.24.32-5.el8.x86_64 harfbuzz-1.7.5-3.el8.x86_64 hicolor-icon-theme-0.17-2.el8.noarch jasper-libs-2.0.14-5.el8.x86_64 jbig2dec-libs-0.16-1.el8.x86_64 jbigkit-libs-2.1-14.el8.x86_64 lcms2-2.9-2.el8.x86_64 less-530-2.el8_9.x86_64 libICE-1.0.9-15.el8.x86_64 libSM-1.2.3-1.el8.x86_64 libX11-1.6.8-6.el8.x86_64 libX11-common-1.6.8-6.el8.noarch libXau-1.0.9-3.el8.x86_64 libXaw-1.0.13-10.el8.x86_64 libXcomposite-0.4.4-14.el8.x86_64 libXcursor-1.1.15-3.el8.x86_64 libXdamage-1.1.4-14.el8.x86_64 libXext-1.3.4-1.el8.x86_64 libXfixes-5.0.3-7.el8.x86_64 libXft-2.3.3-1.el8.x86_64 libXi-1.7.10-1.el8.x86_64 libXinerama-1.1.4-1.el8.x86_64 libXmu-1.1.3-1.el8.x86_64 libXpm-3.5.12-9.el8_7.x86_64 libXrandr-1.5.2-1.el8.x86_64 libXrender-0.9.10-7.el8.x86_64 libXt-1.1.5-12.el8.x86_64 libXxf86misc-1.0.4-1.el8.x86_64 libXxf86vm-1.1.4-9.el8.x86_64 libcroco-0.6.12-4.el8_2.1.x86_64 libcublas-12-4-12.4.5.8-1.x86_64 libcublas-devel-12-4-12.4.5.8-1.x86_64 libcudnn8-8.9.7.29-2.cuda12.3.x86_64 libcudnn8-devel-8.9.7.29-2.cuda12.3.x86_64 libcurand-12-4-10.3.5.147-1.x86_64 libcurand-devel-12-4-10.3.5.147-1.x86_64 libdatrie-0.2.9-7.el8.x86_64 libedit-3.1-23.20170329cvs.el8.x86_64 libfontenc-1.1.3-8.el8.x86_64 libgs-9.27-11.el8.x86_64 libidn-1.34-5.el8.x86_64 libijs-0.35-5.el8.x86_64 libjpeg-turbo-1.5.3-12.el8.x86_64 libmcpp-2.7.2-20.el8.x86_64 libpaper-1.1.24-22.el8.x86_64 libpng-2:1.6.34-5.el8.x86_64 librsvg2-2.42.7-5.el8.x86_64 libthai-0.1.27-2.el8.x86_64 libtiff-4.0.9-29.el8_8.x86_64 libuv-1:1.41.1-1.el8_4.x86_64 libwebp-1.0.0-9.el8_9.1.x86_64 libxcb-1.13.1-1.el8.x86_64 mcpp-2.7.2-20.el8.x86_64 openjpeg2-2.4.0-5.el8.x86_64 openssh-8.0p1-19.el8_9.2.x86_64 openssh-clients-8.0p1-19.el8_9.2.x86_64 openssl-1:1.1.1k-12.el8_9.x86_64 pango-1.42.4-8.el8.x86_64 perl-Carp-1.42-396.el8.noarch perl-Data-Dumper-2.167-399.el8.x86_64 perl-Digest-1.17-395.el8.noarch perl-Digest-MD5-2.55-396.el8.x86_64 perl-Encode-4:2.97-3.el8.x86_64 perl-Errno-1.28-422.el8.x86_64 perl-Error-1:0.17025-2.el8.noarch perl-Exporter-5.72-396.el8.noarch perl-File-Path-2.15-2.el8.noarch perl-File-Temp-0.230.600-1.el8.noarch perl-Getopt-Long-1:2.50-4.el8.noarch perl-Git-2.39.3-1.el8_8.noarch perl-HTTP-Tiny-0.074-2.el8_9.1.noarch perl-IO-1.38-422.el8.x86_64 perl-IO-Socket-IP-0.39-5.el8.noarch perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+594cad75.noarch perl-MIME-Base64-3.15-396.el8.x86_64 perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9eecfe51.noarch perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f0897f98.x86_64 perl-PathTools-3.74-1.el8.x86_64 perl-Pod-Escapes-1:1.07-395.el8.noarch perl-Pod-Perldoc-3.28-396.el8.noarch perl-Pod-Simple-1:3.35-395.el8.noarch perl-Pod-Usage-4:1.69-395.el8.noarch perl-Scalar-List-Utils-3:1.49-2.el8.x86_64 perl-Socket-4:2.027-3.el8.x86_64 perl-Storable-1:3.11-3.el8.x86_64 perl-Term-ANSIColor-4.06-396.el8.noarch perl-Term-Cap-1.17-395.el8.noarch perl-TermReadKey-2.37-7.el8.x86_64 perl-Text-ParseWords-3.30-395.el8.noarch perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch perl-Time-Local-1:1.280-1.el8.noarch perl-URI-1.73-3.el8.noarch perl-Unicode-Normalize-1.25-396.el8.x86_64 perl-constant-1.33-396.el8.noarch perl-interpreter-4:5.26.3-422.el8.x86_64 perl-libnet-3.11-3.el8.noarch perl-libs-4:5.26.3-422.el8.x86_64 perl-macros-4:5.26.3-422.el8.x86_64 perl-parent-1:0.237-1.el8.noarch perl-podlators-4.11-1.el8.noarch perl-threads-1:2.21-2.el8.x86_64 perl-threads-shared-1.58-2.el8.x86_64 pixman-0.38.4-3.el8_9.x86_64 platform-python-devel-3.6.8-56.el8_9.3.x86_64 platform-python-pip-9.0.3-23.el8_9.1.noarch python3-pip-9.0.3-23.el8_9.1.noarch python3-rpm-generators-5-8.el8.noarch python3-setuptools-39.2.0-7.el8.noarch python36-3.6.8-38.module+el8.9.0+20976+d3c38525.x86_64 python36-devel-3.6.8-38.module+el8.9.0+20976+d3c38525.x86_64 python36-rpm-macros-3.6.8-38.module+el8.9.0+20976+d3c38525.noarch shared-mime-info-1.9-3.el8.x86_64 urw-base35-bookman-fonts-20170801-10.el8.noarch urw-base35-c059-fonts-20170801-10.el8.noarch urw-base35-d050000l-fonts-20170801-10.el8.noarch urw-base35-fonts-20170801-10.el8.noarch urw-base35-fonts-common-20170801-10.el8.noarch urw-base35-gothic-fonts-20170801-10.el8.noarch urw-base35-nimbus-mono-ps-fonts-20170801-10.el8.noarch urw-base35-nimbus-roman-fonts-20170801-10.el8.noarch urw-base35-nimbus-sans-fonts-20170801-10.el8.noarch urw-base35-p052-fonts-20170801-10.el8.noarch urw-base35-standard-symbols-ps-fonts-20170801-10.el8.noarch urw-base35-z003-fonts-20170801-10.el8.noarch vim-filesystem-2:8.0.1763-19.el8_6.4.noarch xorg-x11-font-utils-1:7.5-41.el8.x86_64 xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarch xorg-x11-server-utils-7.7-27.el8.x86_64 Complete! Finish: build setup for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Start: rpmbuild cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: x86_64 Building for target x86_64 Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.dZ4JFP + umask 022 + cd /builddir/build/BUILD + cd /builddir/build/BUILD + rm -rf cutlass + /usr/bin/mkdir -p cutlass + cd cutlass + /usr/bin/chmod -Rf a+rX,u+w,g-w,o-w . + git clone --depth 1 -n -b v3.5.0 https://github.com/NVIDIA/cutlass.git . Cloning into '.'... + git reset --hard v3.5.0 HEAD is now at 7d49e6c Updates for CUTLASS 3.5.0 (#1468) + git log --format=fuller commit 7d49e6c7e2f8896c47f586706e67e1fb215529dc Author: Vijay Thakkar AuthorDate: Thu Apr 11 21:33:40 2024 -0400 Commit: GitHub CommitDate: Thu Apr 11 21:33:40 2024 -0400 Updates for CUTLASS 3.5.0 (#1468) + echo 'Patch #0 (cutlass-fp16.patch):' Patch #0 (cutlass-fp16.patch): + /usr/bin/patch --no-backup-if-mismatch -p0 -b --suffix .fp16~ --fuzz=100 patching file include/cutlass/functional.h Hunk #1 succeeded at 217 with fuzz 3 (offset 128 lines). + sed -i /-rpath/d CMakeLists.txt + exit 0 Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.zLMmi6 + umask 022 + cd /builddir/build/BUILD + cd cutlass + mkdir -p build + pushd build ~/build/BUILD/cutlass/build ~/build/BUILD/cutlass + export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64/ + LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64/ + CFLAGS= + export CFLAGS + CXXFLAGS= + export CXXFLAGS + FFLAGS=' -I/usr/lib64/gfortran/modules' + export FFLAGS + FCFLAGS=' -I/usr/lib64/gfortran/modules' + export FCFLAGS + LDFLAGS='-Wl,-z,relro ' + export LDFLAGS + /usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON .. -DCMAKE_SKIP_RPATH=ON -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS=/usr/lib64/libstdc++.so.6 -DBUILD_TESTING=OFF -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_ENABLE_PROFILER=ON -DCUTLASS_ENABLE_EXAMPLES=OFF -DCUDA_PROPAGATE_HOST_FLAGS=OFF -DCUTLASS_NVCC_EMBED_PTX=ON -DCUTLASS_NVCC_EMBED_CUBIN=ON '-DCUTLASS_NVCC_ARCHS=52;61;75;86;89;90' '-DCMAKE_CUDA_FLAGS=-Wl,--no-relax -Xfatbin=-compress-all --compiler-options -fPIC -Wno-deprecated-gpu-targets -allow-unsupported-compiler -D_SERIALIZE_H_INCLUDED' -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc -- CMake Version: 3.26.5 -- CUTLASS 3.5.0 -- The CXX compiler identification is GNU 8.5.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The CUDA compiler identification is NVIDIA 12.4.131 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda-12.4/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- CUDART: /usr/local/cuda-12.4/lib64/libcudart.so -- CUDA Driver: /usr/local/cuda-12.4/lib64/stubs/libcuda.so -- NVRTC: /usr/local/cuda-12.4/lib64/libnvrtc.so -- Default Install Location: /usr -- Found Python3: /usr/bin/python3.6 (found suitable version "3.6.8", minimum required is "3.5") found components: Interpreter CMake Warning at CMakeLists.txt:156 (message): Using unsupported or deprecated compute capabilities 52;61. Support may be removed in future versions. -- CUDA Compilation Architectures: 52;61;75;86;89;90 -- Enable caching of reference results in conv unit tests -- Enable rigorous conv problem sizes in conv unit tests -- Using NVCC flags: --expt-relaxed-constexpr;-DCUTLASS_TEST_LEVEL=0;-DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1;-DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1;-DCUTLASS_DEBUG_TRACE_LEVEL=0;-Xcompiler=-Wconversion;-Xcompiler=-fno-strict-aliasing -- CUTLASS Revision: 7d49e6c -- Configuring cublas ... -- cuBLAS Disabled. -- Configuring cuBLAS ... done. -- Completed generation of library instances. See /builddir/build/BUILD/cutlass/build/tools/library/library_instance_generation.log for more information. -- Configuring done (3.0s) -- Generating done (1.0s) CMake Warning: Manually-specified variables were not used by the project: CMAKE_C_FLAGS_RELEASE CMAKE_Fortran_FLAGS_RELEASE CUDA_PROPAGATE_HOST_FLAGS INCLUDE_INSTALL_DIR LIB_INSTALL_DIR LIB_SUFFIX SHARE_INSTALL_PREFIX SYSCONF_INSTALL_DIR -- Build files have been written to: /builddir/build/BUILD/cutlass/build + make -j4 [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/all_sm90_z1684symm_symm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/all_sm50_cgemm_gemm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/all_sm50_dgemm_gemm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/handle.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_nn_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_nn_align1.cu.o [ 0%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/src/manifest.cpp.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/operation_table.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/singleton.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/util.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_nt_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_nt_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int4.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_symm_sm90_z1684symm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/all_sm50_sgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nn_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_dgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/all_sm60_hgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_nn_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_cgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/all_sm61_igemm_s8_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_canonical.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_tn_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_sgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/all_sm61_s8_igemm_s8_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm61_igemm_s8_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/all_sm70_f16_s884gemm_f16_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nn_align8.cu.o [ 1%] Built target cutlass_library_gemm_sm60_hgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/all_sm70_f16_s884gemm_planar_complex_array_f16_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nt_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_cn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tt_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm61_s8_igemm_s8_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/all_sm70_f16_s884gemm_planar_complex_f16_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_cc_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/all_sm70_h884gemm_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_cn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ct_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_cc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_32.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nh_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_tt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nt_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm70_h884gemm_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/all_sm70_h884gemm_planar_complex_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ch_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ct_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_64.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nh_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_cn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ch_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_cc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e4m3out.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tc_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ct_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tt_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hc_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nh_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ht_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tt_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ch_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_th_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_th_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/all_sm70_h884gemm_planar_complex_array_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_cn_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/all_sm70_s884gemm_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hh_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_s884gemm_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/all_sm70_s884gemm_planar_complex_array_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nn_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/all_sm70_s884gemm_planar_complex_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_cn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_cn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tt_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/all_sm75_f16_s1688gemm_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nt_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_th_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hh_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hh_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/all_sm75_f16_s1688gemm_planar_complex_array_f16_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tt_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/all_sm75_f16_s1688gemm_planar_complex_f16_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nn_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/all_sm75_h1688gemm_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_cn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_cn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_tn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_cc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_cc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_tt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nt_align8.cu.o [ 6%] Built target cutlass_library_gemm_sm75_h1688gemm_objs [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/all_sm75_h1688gemm_planar_complex_gemm_operations.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ct_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ct_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_cn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_cc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ct_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e4m3out.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ht_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ht_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_th_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_th_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tc_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/all_sm75_h1688gemm_planar_complex_array_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hc_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i88128xorgemm_b1_objs.dir/generated/gemm/75/i88128xorgemm_b1/all_sm75_i88128xorgemm_b1_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_cn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i88128xorgemm_b1_objs.dir/generated/gemm/75/i88128xorgemm_b1/cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128.cu.o [ 7%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_s8_objs.dir/generated/gemm/75/i8816gemm_s8/all_sm75_i8816gemm_s8_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ht_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_s8_objs.dir/generated/gemm/75/i8816gemm_s8/cutlass_tensorop_i8816gemm_s8_256x128_64x2_tn_align16.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_th_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_cc_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_i8816gemm_s8_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_u8_objs.dir/generated/gemm/75/i8816gemm_u8/all_sm75_i8816gemm_u8_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_u8_objs.dir/generated/gemm/75/i8816gemm_u8/cutlass_tensorop_i8816gemm_u8_256x128_64x2_tn_align16.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hh_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nt_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8816gemm_u8_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_s4_objs.dir/generated/gemm/75/i8832gemm_s4/all_sm75_i8832gemm_s4_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_s4_objs.dir/generated/gemm/75/i8832gemm_s4/cutlass_tensorop_i8832gemm_s4_256x128_128x2_tn_align32.cu.o [ 8%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_u4_objs.dir/generated/gemm/75/i8832gemm_u4/all_sm75_i8832gemm_u4_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_u4_objs.dir/generated/gemm/75/i8832gemm_u4/cutlass_tensorop_i8832gemm_u4_256x128_128x2_tn_align32.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ct_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8832gemm_s4_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/all_sm75_s1688gemm_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_nn_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8832gemm_u4_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/all_sm75_s1688gemm_planar_complex_array_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nh_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_cn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ch_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_tn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_tt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_cc_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_s1688gemm_f16_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/all_sm75_s1688gemm_planar_complex_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_cn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ct_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_cc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ch_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ct_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_th_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ch_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/all_sm75_s4_i8832gemm_s4_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/cutlass_tensorop_s4_i8832gemm_s4_256x128_128x2_tn_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/cutlass_tensorop_s4_i8832gemm_s4_256x128_128x2_n64t64_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_th_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/all_sm75_s8_i8816gemm_s8_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_n32t32_align16.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/all_sm75_u4_i8832gemm_u4_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/cutlass_tensorop_u4_i8832gemm_u4_256x128_128x2_tn_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_th_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/all_sm75_u8_i8816gemm_u8_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/cutlass_tensorop_u8_i8816gemm_u8_256x128_64x2_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/cutlass_tensorop_u4_i8832gemm_u4_256x128_128x2_n64t64_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e5m2out.cu.o [ 9%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/all_sm80_bf16_s16816gemm_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/cutlass_tensorop_u8_i8816gemm_u8_256x128_64x2_n32t32_align16.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_s8/all_sm80_bf16_s16816gemm_bf16_s8_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_nn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_s8/cutlass_tensorop_bf16_s16816gemm_bf16_s8_128x128_64x4_tn_align16.cu.o [ 9%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_u8/all_sm80_bf16_s16816gemm_bf16_u8_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_u8/cutlass_tensorop_bf16_s16816gemm_bf16_u8_128x128_64x4_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_nt_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/all_sm80_bf16_s16816gemm_planar_complex_array_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_tn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/all_sm80_bf16_s16816gemm_planar_complex_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_cn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_cn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nc_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_s8_bf16/all_sm80_bf16_s16816gemm_s8_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_s8_bf16/cutlass_tensorop_bf16_s16816gemm_s8_bf16_128x128_64x4_tn_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_cc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_cc_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_u8_bf16/all_sm80_bf16_s16816gemm_u8_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_u8_bf16/cutlass_tensorop_bf16_s16816gemm_u8_bf16_128x128_64x4_tn_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ct_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/all_sm80_bf16_s16832spgemm_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ct_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_nn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_nt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ch_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ch_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tc_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/all_sm80_c1688gemm_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_cn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ht_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ht_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_cc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_th_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_th_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nt_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hh_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/all_sm80_c1688tf32gemm_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ct_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nn_align1.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/all_sm80_cgemm_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nh_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_cn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_cn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ch_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_cc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_cc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ct_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ct_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ch_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ht_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_th_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ch_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tn_align1.cu.o [ 11%] Built target cutlass_library_gemm_sm80_c1688gemm_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e5m2out.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ht_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ht_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_th_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_th_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp16out.cu.o [ 11%] Built target cutlass_library_gemm_sm80_cgemm_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/all_sm80_d884gemm_gemm_operations.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_nn_align1.cu.o [ 11%] Built target cutlass_library_gemm_sm80_c1688tf32gemm_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/all_sm80_dgemm_gemm_operations.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_nn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_nt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_nt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_tn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_tn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_tt_align1.cu.o [ 12%] Built target cutlass_library_gemm_sm80_d884gemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/all_sm80_f16_s16816gemm_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_nn_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_dgemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_s8/all_sm80_f16_s16816gemm_f16_s8_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_s8/cutlass_tensorop_f16_s16816gemm_f16_s8_128x128_64x4_tn_align16.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_nt_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tn_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_u8/all_sm80_f16_s16816gemm_f16_u8_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_u8/cutlass_tensorop_f16_s16816gemm_f16_u8_128x128_64x4_tn_align16.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tt_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/all_sm80_f16_s16816gemm_planar_complex_array_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nn_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/all_sm80_f16_s16816gemm_planar_complex_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_cn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_cn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_cc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_cc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nt_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_bf16out.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nt_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ct_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ct_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nh_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nh_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ch_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ch_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ht_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ht_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_th_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_th_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hh_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hh_align8.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_s8_f16/all_sm80_f16_s16816gemm_s8_f16_gemm_operations.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_u8_f16/all_sm80_f16_s16816gemm_u8_f16_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_s8_f16/cutlass_tensorop_f16_s16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_u8_f16/cutlass_tensorop_f16_s16816gemm_u8_f16_128x128_64x4_tn_align16.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp32out.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/all_sm80_f16_s16832spgemm_f16_gemm_operations.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/all_sm80_gz884gemm_gemm_operations.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_nn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp32out.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_cn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_nt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nc_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_tn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_cc_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_tt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nt_align1.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/all_sm80_h16816gemm_gemm_operations.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ct_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_nn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nh_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_nt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ch_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_tn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tn_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_tt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hn_align1.cu.o [ 15%] Built target cutlass_library_gemm_sm80_h16816gemm_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/all_sm80_h16816gemm_grouped_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tc_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hc_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tt_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ht_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_other.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_th_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_mixed_input.cu.o [ 15%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/all_sm80_h16816gemm_planar_complex_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hh_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_cn_align8.cu.o [ 15%] Built target cutlass_library_gemm_sm80_gz884gemm_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/all_sm80_h16816gemm_planar_complex_array_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_cn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_cc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_cc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ct_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nh_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ct_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ch_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nh_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ch_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/initialize_reference_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/reduction_device.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/init_reduction_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv2d.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ht_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_th_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv3d.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ht_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_th_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs.dir/generated/gemm/80/h16816gemm_s8_f16/all_sm80_h16816gemm_s8_f16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs.dir/generated/gemm/80/h16816gemm_s8_f16/cutlass_tensorop_h16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hh_align8.cu.o [ 16%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/initialize_all.cpp.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/all_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/all_sm80_h16832spgemm_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/all_conv2d_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168128spgemm_s4_objs.dir/generated/gemm/80/i168128spgemm_s4/all_sm80_i168128spgemm_s4_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv3d/all_conv3d_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168128spgemm_s4_objs.dir/generated/gemm/80/i168128spgemm_s4/cutlass_tensorop_i168128spgemm_s4_64x64_256x4_tn_align32.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/rank_k/all_rank_k_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/rank_2k/all_rank_2k_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/trmm/all_trmm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256andgemm_b1_objs.dir/generated/gemm/80/i168256andgemm_b1/all_sm80_i168256andgemm_b1_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/symm/all_symm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256andgemm_b1_objs.dir/generated/gemm/80/i168256andgemm_b1/cutlass_tensorop_i168256andgemm_b1_256x128_512x3_tn_align128.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256xorgemm_b1_objs.dir/generated/gemm/80/i168256xorgemm_b1/all_sm80_i168256xorgemm_b1_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256xorgemm_b1_objs.dir/generated/gemm/80/i168256xorgemm_b1/cutlass_tensorop_i168256xorgemm_b1_256x128_512x3_tn_align128.cu.o [ 17%] Built target cutlass_library_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_s8_objs.dir/generated/gemm/80/i16832gemm_s8/all_sm80_i16832gemm_s8_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_u8_objs.dir/generated/gemm/80/i16832gemm_u8/all_sm80_i16832gemm_u8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_s8_objs.dir/generated/gemm/80/i16832gemm_s8/cutlass_tensorop_i16832gemm_s8_256x128_64x3_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_u8_objs.dir/generated/gemm/80/i16832gemm_u8/cutlass_tensorop_i16832gemm_u8_256x128_64x3_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_tt_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_s4_objs.dir/generated/gemm/80/i16864gemm_s4/all_sm80_i16864gemm_s4_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_s4_objs.dir/generated/gemm/80/i16864gemm_s4/cutlass_tensorop_i16864gemm_s4_256x128_128x3_tn_align32.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16832gemm_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_u4_objs.dir/generated/gemm/80/i16864gemm_u4/all_sm80_i16864gemm_u4_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16832gemm_u8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864spgemm_s8_objs.dir/generated/gemm/80/i16864spgemm_s8/all_sm80_i16864spgemm_s8_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16832spgemm_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/all_sm80_s16816gemm_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_u4_objs.dir/generated/gemm/80/i16864gemm_u4/cutlass_tensorop_i16864gemm_u4_256x128_128x3_tn_align32.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864spgemm_s8_objs.dir/generated/gemm/80/i16864spgemm_s8/cutlass_tensorop_i16864spgemm_s8_128x64_128x3_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16864gemm_s4_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/s16816gemm_bf16_s8/all_sm80_s16816gemm_bf16_s8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/s16816gemm_bf16_s8/cutlass_tensorop_s16816gemm_bf16_s8_128x128_64x4_tn_align16.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16864gemm_u4_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/s16816gemm_bf16_u8/all_sm80_s16816gemm_bf16_u8_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/all_sm80_s16816gemm_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/s16816gemm_bf16_u8/cutlass_tensorop_s16816gemm_bf16_u8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs.dir/generated/gemm/80/s16816gemm_f16_s8/all_sm80_s16816gemm_f16_s8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs.dir/generated/gemm/80/s16816gemm_f16_s8/cutlass_tensorop_s16816gemm_f16_s8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_nt_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs.dir/generated/gemm/80/s16816gemm_f16_u8/all_sm80_s16816gemm_f16_u8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs.dir/generated/gemm/80/s16816gemm_f16_u8/cutlass_tensorop_s16816gemm_f16_u8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_tn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/all_sm80_s16816gemm_grouped_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/all_sm80_s16816gemm_grouped_f16_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/all_sm80_s16816gemm_planar_complex_array_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/all_sm80_s16816gemm_planar_complex_array_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_cn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_cn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nc_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/all_sm80_s16816gemm_planar_complex_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/all_sm80_s16816gemm_planar_complex_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_cn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_cn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ch_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ch_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ch_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ch_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ht_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ht_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_th_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_th_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hh_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hh_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/s16816gemm_s8_bf16/all_sm80_s16816gemm_s8_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ht_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/s16816gemm_s8_bf16/cutlass_tensorop_s16816gemm_s8_bf16_128x128_64x4_tn_align16.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ht_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs.dir/generated/gemm/80/s16816gemm_s8_f16/all_sm80_s16816gemm_s8_f16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_th_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs.dir/generated/gemm/80/s16816gemm_s8_f16/cutlass_tensorop_s16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_th_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/s16816gemm_u8_bf16/all_sm80_s16816gemm_u8_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/s16816gemm_u8_bf16/cutlass_tensorop_s16816gemm_u8_bf16_128x128_64x4_tn_align16.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hh_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hh_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cfprop_optimized_cf32/all_sm75_cf32_cfprop_optimized_cf32_conv2d_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cfprop_optimized_cf32/cutlass_simt_cf32_cfprop_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs.dir/generated/gemm/80/s16816gemm_u8_f16/all_sm80_s16816gemm_u8_f16_gemm_operations.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/all_sm80_s16816tf32spgemm_gemm_operations.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/all_sm80_s16832spgemm_bf16_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs.dir/generated/gemm/80/s16816gemm_u8_f16/cutlass_tensorop_s16816gemm_u8_f16_128x128_64x4_tn_align16.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_nn_align8.cu.o [ 20%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/all_sm80_s16832spgemm_f16_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_nn_align8.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/all_sm80_s1688bf16gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_nt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_nt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_tn_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_tt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_tn_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_tt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_tn_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/all_sm80_s1688f16gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_tt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_tt_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/all_sm80_s1688gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_nn_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/all_sm80_s1688gemm_tf32_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_nt_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688bf16gemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_tt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_tt_align4.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/all_sm80_s1688tf32gemm_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_nn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_nt_align4.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688f16gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs.dir/generated/gemm/80/s4_i168128spgemm_s4/all_sm80_s4_i168128spgemm_s4_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/all_sm80_s4_i16864gemm_s4_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs.dir/generated/gemm/80/s4_i168128spgemm_s4/cutlass_tensorop_s4_i168128spgemm_s4_64x64_256x4_tn_align32.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/cutlass_tensorop_s4_i16864gemm_s4_256x128_128x3_tn_align32.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align4.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/all_sm80_s8_i16832gemm_s8_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/cutlass_tensorop_s8_i16832gemm_s8_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/cutlass_tensorop_s4_i16864gemm_s4_256x128_128x3_n64t64_align32.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/cutlass_tensorop_s8_i16832gemm_s8_256x128_64x3_n32t32_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688tf32gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs.dir/generated/gemm/80/s8_i16864spgemm_s8/all_sm80_s8_i16864spgemm_s8_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs.dir/generated/gemm/80/s8_i16864spgemm_s8/cutlass_tensorop_s8_i16864spgemm_s8_128x64_128x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/all_sm80_sgemm_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/all_sm80_tf32_s1688gemm_tf32_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/all_sm80_u4_i16864gemm_u4_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_nn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/cutlass_tensorop_u4_i16864gemm_u4_256x128_128x3_tn_align32.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/all_sm80_u8_i16832gemm_u8_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/cutlass_tensorop_u8_i16832gemm_u8_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/cutlass_tensorop_u4_i16864gemm_u4_256x128_128x3_n64t64_align32.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_nt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nt_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/cutlass_tensorop_u8_i16832gemm_u8_256x128_64x3_n32t32_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/all_sm80_z884gemm_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nn_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3/all_sm89_s16832fastaccumgemm_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3/cutlass_tensorop_s16832fastaccumgemm_e4m3_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_cn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3_e5m2/all_sm89_s16832fastaccumgemm_e4m3_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3_e5m2/cutlass_tensorop_s16832fastaccumgemm_e4m3_e5m2_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nc_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tt_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2/all_sm89_s16832fastaccumgemm_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2/cutlass_tensorop_s16832fastaccumgemm_e5m2_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_cc_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2_e4m3/all_sm89_s16832fastaccumgemm_e5m2_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2_e4m3/cutlass_tensorop_s16832fastaccumgemm_e5m2_e4m3_256x128_64x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm80_sgemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_objs.dir/generated/gemm/89/s16832gemm_e4m3/all_sm89_s16832gemm_e4m3_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832gemm_e4m3_e5m2/all_sm89_s16832gemm_e4m3_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nt_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_objs.dir/generated/gemm/89/s16832gemm_e4m3/cutlass_tensorop_s16832gemm_e4m3_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832gemm_e4m3_e5m2/cutlass_tensorop_s16832gemm_e4m3_e5m2_256x128_64x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_objs.dir/generated/gemm/89/s16832gemm_e5m2/all_sm89_s16832gemm_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_objs.dir/generated/gemm/89/s16832gemm_e5m2/cutlass_tensorop_s16832gemm_e5m2_256x128_64x3_tn_align16.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ct_align1.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832gemm_e5m2_e4m3/all_sm89_s16832gemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3/all_sm89_s16864fastaccumspgemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832gemm_e5m2_e4m3/cutlass_tensorop_s16832gemm_e5m2_e4m3_256x128_64x3_tn_align16.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3/cutlass_tensorop_s16864fastaccumspgemm_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3_e5m2/all_sm89_s16864fastaccumspgemm_e4m3_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nh_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3_e5m2/cutlass_tensorop_s16864fastaccumspgemm_e4m3_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2/all_sm89_s16864fastaccumspgemm_e5m2_gemm_operations.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2_e4m3/all_sm89_s16864fastaccumspgemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2/cutlass_tensorop_s16864fastaccumspgemm_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2_e4m3/cutlass_tensorop_s16864fastaccumspgemm_e5m2_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ch_align1.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e4m3/all_sm89_s16864spgemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e4m3/cutlass_tensorop_s16864spgemm_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e4m3_e5m2/all_sm89_s16864spgemm_e4m3_e5m2_gemm_operations.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e5m2/all_sm89_s16864spgemm_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tn_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e4m3_e5m2/cutlass_tensorop_s16864spgemm_e4m3_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e5m2/cutlass_tensorop_s16864spgemm_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e5m2_e4m3/all_sm89_s16864spgemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e5m2_e4m3/cutlass_tensorop_s16864spgemm_e5m2_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hn_align1.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/all_sm90_bf16_s64x128x16gemm_bf16_gemm_operations.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/all_sm90_bf16_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tc_align1.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/all_sm90_bf16_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hc_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tt_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ht_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_th_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hh_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 22%] Built target cutlass_library_gemm_sm80_z884gemm_objs [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/all_sm90_bf16_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/all_sm90_bf16_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/all_sm90_d1684gemm_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_nnn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/all_sm90_f16_s64x128x16gemm_f16_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_ntn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_tnn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_ttn_align1.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 30%] Built target cutlass_library_gemm_sm90_d1684gemm_objs [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/all_sm90_f16_s64x128x32gemm_e4m3_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/all_sm90_f16_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/all_sm90_f16_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/all_sm90_f16_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 36%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/all_sm90_gz1684gemm_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_nnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_cnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ncn_align1.cu.o [ 36%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/all_sm90_h64x128x16gemm_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ccn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ntn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ctn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_nhn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_chn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_tnn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hnn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_tcn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hcn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ttn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_htn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_thn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hhn_align1.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Built target cutlass_library_gemm_sm90_gz1684gemm_objs [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/all_sm90_i64x128x32gemm_s8_gemm_operations.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 36%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/all_sm90_i64x128x32gemm_u8_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/all_sm90_s64x128x16gemm_bf16_gemm_operations.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/all_sm90_s64x128x16gemm_f16_gemm_operations.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o [ 40%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/all_sm90_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Built target cutlass_library_gemm_sm90_h64x128x16gemm_objs [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/all_sm90_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 43%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/all_sm90_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 46%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/all_sm90_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/all_sm90_s64x128x8gemm_tf32_gemm_operations.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/all_sm90_s64x128x8tf32gemm_gemm_operations.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_cooperative.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_ttn_align4_warpspecialized_cooperative.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align2_cpasync_warpspecialized.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align1_cpasync_warpspecialized.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/all_sm90_s8_i64x128x32gemm_s8_gemm_operations.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/all_sm90_s8_i64x128x32gemm_u8_gemm_operations.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/all_sm90_void_i64x128x32gemm_s8_gemm_operations.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/all_sm90_void_i64x128x32gemm_u8_gemm_operations.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align2_cpasync_warpspecialized.cu.o [ 52%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/all_sm90_void_s64x128x16gemm_bf16_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/all_sm90_void_s64x128x16gemm_f16_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align1_cpasync_warpspecialized.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/all_sm90_void_s64x128x32gemm_e4m3_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/all_sm90_void_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/all_sm90_void_s64x128x32gemm_e5m2_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/all_sm90_void_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/all_sm90_z1684gemm_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_nnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_cnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/all_sm50_cf32_cdgrad_optimized_cf32_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x64_8x2_nhwc_unity_stride_align1.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ncn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ccn_align1.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x64_8x2_nhwc_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ntn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ctn_align1.cu.o [ 52%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cfprop_optimized_cf32/all_sm50_cf32_cfprop_optimized_cf32_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cfprop_optimized_cf32/cutlass_simt_cf32_cfprop_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_nhn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cwgrad_optimized_cf32/all_sm50_cf32_cwgrad_optimized_cf32_conv2d_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_chn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cwgrad_optimized_cf32/cutlass_simt_cf32_cwgrad_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_tnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 53%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/all_sm50_sdgrad_optimized_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_unity_stride_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_tcn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sfprop_optimized_objs.dir/generated/conv2d/50/sfprop_optimized/all_sm50_sfprop_optimized_conv2d_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hcn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sfprop_optimized_objs.dir/generated/conv2d/50/sfprop_optimized/cutlass_simt_sfprop_optimized_128x128_8x2_nhwc_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ttn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_sfprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_swgrad_optimized_objs.dir/generated/conv2d/50/swgrad_optimized/all_sm50_swgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_swgrad_optimized_objs.dir/generated/conv2d/50/swgrad_optimized/cutlass_simt_swgrad_optimized_128x128_8x2_nhwc_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_htn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_swgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm60_hfprop_optimized_objs.dir/generated/conv2d/60/hfprop_optimized/all_sm60_hfprop_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_thn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm60_hfprop_optimized_objs.dir/generated/conv2d/60/hfprop_optimized/cutlass_simt_hfprop_optimized_64x32x9_1x8x8x32_3_filter3x3_nhwc_depthwise_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hhn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm60_hfprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/all_sm70_f16_s884dgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_gemm_sm90_z1684gemm_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/f16_s884fprop_optimized_f16/all_sm70_f16_s884fprop_optimized_f16_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/f16_s884fprop_optimized_f16/cutlass_tensorop_f16_s884fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884wgrad_optimized_f16/all_sm70_f16_s884wgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/all_sm70_h884dgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884wgrad_optimized_f16/cutlass_tensorop_f16_s884wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_unity_stride_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884fprop_optimized_objs.dir/generated/conv2d/70/h884fprop_optimized/all_sm70_h884fprop_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884fprop_optimized_objs.dir/generated/conv2d/70/h884fprop_optimized/cutlass_tensorop_h884fprop_optimized_256x128_32x2_nhwc_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884wgrad_optimized_objs.dir/generated/conv2d/70/h884wgrad_optimized/all_sm70_h884wgrad_optimized_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/all_sm70_s884dgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884wgrad_optimized_objs.dir/generated/conv2d/70/h884wgrad_optimized/cutlass_tensorop_h884wgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/s884fprop_optimized_f16/all_sm70_s884fprop_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/s884fprop_optimized_f16/cutlass_tensorop_s884fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/s884wgrad_optimized_f16/all_sm70_s884wgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/all_sm75_cf32_cdgrad_optimized_cf32_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/s884wgrad_optimized_f16/cutlass_tensorop_s884wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x5_nhwc_unity_stride_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 55%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cwgrad_optimized_cf32/all_sm75_cf32_cwgrad_optimized_cf32_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cwgrad_optimized_cf32/cutlass_simt_cf32_cwgrad_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x5_nhwc_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 55%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/all_sm75_f16_s1688dgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/cutlass_tensorop_f16_s1688dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 55%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_few_channels_f16/all_sm75_f16_s1688fprop_few_channels_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_fixed_channels_f16/all_sm75_f16_s1688fprop_fixed_channels_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_few_channels_f16/cutlass_tensorop_f16_s1688fprop_few_channels_f16_128x64_32x2_nhwc_align1.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_optimized_f16/all_sm75_f16_s1688fprop_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_fixed_channels_f16/cutlass_tensorop_f16_s1688fprop_fixed_channels_f16_128x64_32x2_nhwc_align4.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/cutlass_tensorop_f16_s1688dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_optimized_f16/cutlass_tensorop_f16_s1688fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688wgrad_optimized_f16/all_sm75_f16_s1688wgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/all_sm75_h1688dgrad_optimized_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs.dir/generated/conv2d/75/h1688fprop_few_channels/all_sm75_h1688fprop_few_channels_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688wgrad_optimized_f16/cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs.dir/generated/conv2d/75/h1688fprop_few_channels/cutlass_tensorop_h1688fprop_few_channels_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/cutlass_tensorop_h1688dgrad_optimized_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/cutlass_tensorop_h1688dgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs.dir/generated/conv2d/75/h1688fprop_fixed_channels/all_sm75_h1688fprop_fixed_channels_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs.dir/generated/conv2d/75/h1688fprop_fixed_channels/cutlass_tensorop_h1688fprop_fixed_channels_128x64_32x2_nhwc_align4.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_optimized_objs.dir/generated/conv2d/75/h1688fprop_optimized/all_sm75_h1688fprop_optimized_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_optimized_objs.dir/generated/conv2d/75/h1688fprop_optimized/cutlass_tensorop_h1688fprop_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs.dir/generated/conv2d/75/h1688wgrad_optimized/all_sm75_h1688wgrad_optimized_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs.dir/generated/conv2d/75/h1688wgrad_optimized/cutlass_tensorop_h1688wgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/i8816fprop_optimized_s8/all_sm75_i8816fprop_optimized_s8_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/i8816fprop_optimized_u8/all_sm75_i8816fprop_optimized_u8_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/i8816fprop_optimized_s8/cutlass_tensorop_i8816fprop_optimized_s8_256x128_64x2_nhwc_align16.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/i8832fprop_optimized_s4/all_sm75_i8832fprop_optimized_s4_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/i8816fprop_optimized_u8/cutlass_tensorop_i8816fprop_optimized_u8_256x128_64x2_nhwc_align16.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/i8832fprop_optimized_u4/all_sm75_i8832fprop_optimized_u4_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/i8832fprop_optimized_s4/cutlass_tensorop_i8832fprop_optimized_s4_256x128_128x2_nhwc_align32.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/i8832fprop_optimized_u4/cutlass_tensorop_i8832fprop_optimized_u4_256x128_128x2_nhwc_align32.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/all_sm75_s1688dgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_few_channels_f16/all_sm75_s1688fprop_few_channels_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/cutlass_tensorop_s1688dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_few_channels_f16/cutlass_tensorop_s1688fprop_few_channels_f16_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_fixed_channels_f16/all_sm75_s1688fprop_fixed_channels_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/s1688fprop_optimized_f16/all_sm75_s1688fprop_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_fixed_channels_f16/cutlass_tensorop_s1688fprop_fixed_channels_f16_128x64_32x2_nhwc_align4.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/s1688fprop_optimized_f16/cutlass_tensorop_s1688fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/cutlass_tensorop_s1688dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688wgrad_optimized_f16/all_sm75_s1688wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/all_sm75_s4_i8832fprop_optimized_s4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688wgrad_optimized_f16/cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_few_channels_s8/all_sm75_s8_i8816fprop_few_channels_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/cutlass_tensorop_s4_i8832fprop_optimized_s4_256x128_128x2_nhwc_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_fixed_channels_s8/all_sm75_s8_i8816fprop_fixed_channels_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_few_channels_s8/cutlass_tensorop_s8_i8816fprop_few_channels_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_fixed_channels_s8/cutlass_tensorop_s8_i8816fprop_fixed_channels_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/all_sm75_s8_i8816fprop_optimized_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/cutlass_tensorop_s4_i8832fprop_optimized_s4_256x128_128x2_nc64hw64_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/cutlass_tensorop_s8_i8816fprop_optimized_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/all_sm75_u4_i8832fprop_optimized_u4_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_few_channels_u8/all_sm75_u8_i8816fprop_few_channels_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/cutlass_tensorop_u4_i8832fprop_optimized_u4_256x128_128x2_nhwc_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_few_channels_u8/cutlass_tensorop_u8_i8816fprop_few_channels_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_fixed_channels_u8/all_sm75_u8_i8816fprop_fixed_channels_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/cutlass_tensorop_s8_i8816fprop_optimized_s8_256x128_64x2_nc32hw32_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_fixed_channels_u8/cutlass_tensorop_u8_i8816fprop_fixed_channels_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/cutlass_tensorop_u4_i8832fprop_optimized_u4_256x128_128x2_nc64hw64_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/all_sm75_u8_i8816fprop_optimized_u8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/all_sm80_bf16_s16816dgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_fixed_channels_bf16/all_sm80_bf16_s16816fprop_fixed_channels_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/cutlass_tensorop_u8_i8816fprop_optimized_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_fixed_channels_bf16/cutlass_tensorop_bf16_s16816fprop_fixed_channels_bf16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/all_sm80_bf16_s16816fprop_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/cutlass_tensorop_bf16_s16816fprop_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/cutlass_tensorop_u8_i8816fprop_optimized_u8_256x128_64x2_nc32hw32_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816wgrad_optimized_bf16/all_sm80_bf16_s16816wgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/cutlass_tensorop_bf16_s16816fprop_optimized_bf16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816wgrad_optimized_bf16/cutlass_tensorop_bf16_s16816wgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/all_sm80_f16_s16816dgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_fixed_channels_f16/all_sm80_f16_s16816fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_fixed_channels_f16/cutlass_tensorop_f16_s16816fprop_fixed_channels_f16_256x128_32x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/all_sm80_f16_s16816fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816wgrad_optimized_f16/all_sm80_f16_s16816wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816wgrad_optimized_f16/cutlass_tensorop_f16_s16816wgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/all_sm80_h16816dgrad_optimized_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs.dir/generated/conv2d/80/h16816fprop_fixed_channels/all_sm80_h16816fprop_fixed_channels_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/cutlass_tensorop_h16816dgrad_optimized_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/all_sm80_h16816fprop_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs.dir/generated/conv2d/80/h16816fprop_fixed_channels/cutlass_tensorop_h16816fprop_fixed_channels_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/cutlass_tensorop_h16816fprop_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/cutlass_tensorop_h16816dgrad_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs.dir/generated/conv2d/80/h16816wgrad_optimized/all_sm80_h16816wgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs.dir/generated/conv2d/80/h16816wgrad_optimized/cutlass_tensorop_h16816wgrad_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/all_sm80_i16832fprop_optimized_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/cutlass_tensorop_h16816fprop_optimized_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/all_sm80_i16832fprop_optimized_u8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/all_sm80_i16864fprop_optimized_s4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/cutlass_tensorop_i16832fprop_optimized_u8_256x128_64x3_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/all_sm80_i16864fprop_optimized_u4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/cutlass_tensorop_i16864fprop_optimized_s4_256x128_128x3_nhwc_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_single_group_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/cutlass_tensorop_i16864fprop_optimized_u4_256x128_128x3_nhwc_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/cutlass_tensorop_i16832fprop_optimized_u8_256x128_64x3_nhwc_single_group_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/cutlass_tensorop_i16864fprop_optimized_s4_256x128_128x3_nhwc_single_group_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/all_sm80_s16816dgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/cutlass_tensorop_i16864fprop_optimized_u4_256x128_128x3_nhwc_single_group_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/cutlass_tensorop_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/all_sm80_s16816dgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_bf16/all_sm80_s16816fprop_fixed_channels_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/cutlass_tensorop_s16816dgrad_optimized_f16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_f16/all_sm80_s16816fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_bf16/cutlass_tensorop_s16816fprop_fixed_channels_bf16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/cutlass_tensorop_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_f16/cutlass_tensorop_s16816fprop_fixed_channels_f16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/cutlass_tensorop_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/all_sm80_s16816fprop_optimized_bf16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/all_sm80_s16816fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_bf16/all_sm80_s16816wgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/cutlass_tensorop_s16816fprop_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/cutlass_tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_bf16/cutlass_tensorop_s16816wgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_f16/all_sm80_s16816wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_f16/cutlass_tensorop_s16816wgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/cutlass_tensorop_s16816fprop_optimized_bf16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/cutlass_tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/all_sm80_s1688bf16dgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/cutlass_tensorop_s1688bf16dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/all_sm80_s1688bf16fprop_optimized_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16wgrad_optimized/all_sm80_s1688bf16wgrad_optimized_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/all_sm80_s1688dgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/cutlass_tensorop_s1688bf16fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16wgrad_optimized/cutlass_tensorop_s1688bf16wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/cutlass_tensorop_s1688dgrad_optimized_128x128_16x4_nhwc_unity_stride_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/cutlass_tensorop_s1688bf16dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/cutlass_tensorop_s1688bf16fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/all_sm80_s1688dgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/cutlass_tensorop_s1688dgrad_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/cutlass_tensorop_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/all_sm80_s1688f16dgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/all_sm80_s1688f16fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/cutlass_tensorop_s1688f16dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/cutlass_tensorop_s1688f16fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/cutlass_tensorop_s1688f16fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/cutlass_tensorop_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/cutlass_tensorop_s1688f16dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs.dir/generated/conv2d/80/s1688f16wgrad_optimized/all_sm80_s1688f16wgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/all_sm80_s1688fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs.dir/generated/conv2d/80/s1688f16wgrad_optimized/cutlass_tensorop_s1688f16wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/all_sm80_s1688fprop_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/cutlass_tensorop_s1688fprop_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/cutlass_tensorop_s1688fprop_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/all_sm80_s1688tf32dgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/cutlass_tensorop_s1688tf32dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/all_sm80_s1688tf32fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/cutlass_tensorop_s1688fprop_optimized_128x128_16x4_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/cutlass_tensorop_s1688tf32fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/cutlass_tensorop_s1688fprop_optimized_tf32_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/cutlass_tensorop_s1688tf32fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/cutlass_tensorop_s1688tf32dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32wgrad_optimized/all_sm80_s1688tf32wgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32wgrad_optimized/cutlass_tensorop_s1688tf32wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs.dir/generated/conv2d/80/s1688wgrad_optimized/all_sm80_s1688wgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688wgrad_optimized_tf32/all_sm80_s1688wgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/all_sm80_s4_i16864fprop_optimized_s4_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs.dir/generated/conv2d/80/s1688wgrad_optimized/cutlass_tensorop_s1688wgrad_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688wgrad_optimized_tf32/cutlass_tensorop_s1688wgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_few_channels_s8/all_sm80_s8_i16832fprop_few_channels_s8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nhwc_align32.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_few_channels_s8/cutlass_tensorop_s8_i16832fprop_few_channels_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_fixed_channels_s8/all_sm80_s8_i16832fprop_fixed_channels_s8_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/all_sm80_s8_i16832fprop_optimized_s8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_fixed_channels_s8/cutlass_tensorop_s8_i16832fprop_fixed_channels_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nhwc_single_group_align32.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/all_sm80_sdgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/cutlass_simt_sdgrad_optimized_256x128_8x5_nhwc_unity_stride_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nc64hw64_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/cutlass_simt_sdgrad_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nhwc_single_group_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sfprop_optimized_objs.dir/generated/conv2d/80/sfprop_optimized/all_sm80_sfprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nc32hw32_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sfprop_optimized_objs.dir/generated/conv2d/80/sfprop_optimized/cutlass_simt_sfprop_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_swgrad_optimized_objs.dir/generated/conv2d/80/swgrad_optimized/all_sm80_swgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_swgrad_optimized_objs.dir/generated/conv2d/80/swgrad_optimized/cutlass_simt_swgrad_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/all_sm80_tf32_s1688dgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/all_sm80_tf32_s1688fprop_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/cutlass_tensorop_tf32_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/cutlass_tensorop_tf32_s1688fprop_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_sfprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688wgrad_optimized_tf32/all_sm80_tf32_s1688wgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688wgrad_optimized_tf32/cutlass_tensorop_tf32_s1688wgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_swgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/all_sm80_u4_i16864fprop_optimized_u4_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/cutlass_tensorop_tf32_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/cutlass_tensorop_tf32_s1688fprop_optimized_tf32_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nhwc_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_few_channels_u8/all_sm80_u8_i16832fprop_few_channels_u8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nhwc_single_group_align32.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_few_channels_u8/cutlass_tensorop_u8_i16832fprop_few_channels_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_fixed_channels_u8/all_sm80_u8_i16832fprop_fixed_channels_u8_conv2d_operations.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/all_sm80_u8_i16832fprop_optimized_u8_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_fixed_channels_u8/cutlass_tensorop_u8_i16832fprop_fixed_channels_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nc64hw64_align32.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e4m3/all_sm89_s16832fprop_fixed_channels_e4m3_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e4m3/cutlass_tensorop_s16832fprop_fixed_channels_e4m3_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e5m2/all_sm89_s16832fprop_fixed_channels_e5m2_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nhwc_single_group_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/all_sm89_s16832fprop_optimized_e4m3_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e5m2/cutlass_tensorop_s16832fprop_fixed_channels_e5m2_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/cutlass_tensorop_s16832fprop_optimized_e4m3_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/all_sm89_s16832fprop_optimized_e5m2_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nc32hw32_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/all_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/cutlass_tensorop_s16832fprop_optimized_e5m2_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/cutlass_tensorop_s16832fprop_optimized_e4m3_256x128_64x3_nhwc_single_group_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/all_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/cutlass_tensorop_s16832fprop_optimized_e5m2_256x128_64x3_nhwc_single_group_align16.cu.o [ 59%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/all_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/all_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_analytic_bf16/all_sm80_bf16_s16816dgrad3d_analytic_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_analytic_bf16/cutlass_tensorop_bf16_s16816dgrad3d_analytic_bf16_256x128_32x3.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_optimized_bf16/all_sm80_bf16_s16816dgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad3d_optimized_bf16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816fprop3d_optimized_bf16/all_sm80_bf16_s16816fprop3d_optimized_bf16_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816wgrad3d_optimized_bf16/all_sm80_bf16_s16816wgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816fprop3d_optimized_bf16/cutlass_tensorop_bf16_s16816fprop3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816wgrad3d_optimized_bf16/cutlass_tensorop_bf16_s16816wgrad3d_optimized_bf16_256x128_32x3.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_analytic_f16/all_sm80_f16_s16816dgrad3d_analytic_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_optimized_f16/all_sm80_f16_s16816dgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_analytic_f16/cutlass_tensorop_f16_s16816dgrad3d_analytic_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_optimized_f16/cutlass_tensorop_f16_s16816dgrad3d_optimized_f16_256x128_32x3_unity_stride.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816fprop3d_optimized_f16/all_sm80_f16_s16816fprop3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816fprop3d_optimized_f16/cutlass_tensorop_f16_s16816fprop3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816wgrad3d_optimized_f16/all_sm80_f16_s16816wgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs.dir/generated/conv3d/80/h16816dgrad3d_analytic/all_sm80_h16816dgrad3d_analytic_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816wgrad3d_optimized_f16/cutlass_tensorop_f16_s16816wgrad3d_optimized_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs.dir/generated/conv3d/80/h16816dgrad3d_analytic/cutlass_tensorop_h16816dgrad3d_analytic_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs.dir/generated/conv3d/80/h16816dgrad3d_optimized/all_sm80_h16816dgrad3d_optimized_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs.dir/generated/conv3d/80/h16816dgrad3d_optimized/cutlass_tensorop_h16816dgrad3d_optimized_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs.dir/generated/conv3d/80/h16816fprop3d_optimized/all_sm80_h16816fprop3d_optimized_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs.dir/generated/conv3d/80/h16816wgrad3d_optimized/all_sm80_h16816wgrad3d_optimized_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs.dir/generated/conv3d/80/h16816fprop3d_optimized/cutlass_tensorop_h16816fprop3d_optimized_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs.dir/generated/conv3d/80/h16816wgrad3d_optimized/cutlass_tensorop_h16816wgrad3d_optimized_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_bf16/all_sm80_s16816dgrad3d_analytic_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_bf16/cutlass_tensorop_s16816dgrad3d_analytic_bf16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_f16/all_sm80_s16816dgrad3d_analytic_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_bf16/all_sm80_s16816dgrad3d_optimized_bf16_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_f16/all_sm80_s16816dgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_f16/cutlass_tensorop_s16816dgrad3d_analytic_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_bf16/cutlass_tensorop_s16816dgrad3d_optimized_bf16_256x128_32x3_unity_stride.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_f16/cutlass_tensorop_s16816dgrad3d_optimized_f16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_bf16/all_sm80_s16816fprop3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_bf16/cutlass_tensorop_s16816fprop3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_f16/all_sm80_s16816fprop3d_optimized_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_bf16/all_sm80_s16816wgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_f16/all_sm80_s16816wgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_f16/cutlass_tensorop_s16816fprop3d_optimized_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_bf16/cutlass_tensorop_s16816wgrad3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_f16/cutlass_tensorop_s16816wgrad3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/all_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/all_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/all_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/all_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/all_sm80_c1688herk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_n_u_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_h_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_h_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/all_sm80_c1688syrk_rank_k_operations.cu.o [ 59%] Built target cutlass_library_rank_k_sm80_c1688herk_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/all_sm80_c1688tf32herk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_n_l_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_h_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_h_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_t_l_align1.cu.o [ 59%] Built target cutlass_library_rank_k_sm80_c1688tf32herk_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/all_sm80_c1688tf32syrk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_t_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_n_u_align1.cu.o [ 59%] Built target cutlass_library_rank_k_sm80_c1688syrk_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/all_sm80_d884syrk_rank_k_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/all_sm80_gz884herk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_t_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/all_sm80_gz884syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_t_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_gz884herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/all_sm80_s1688syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_gz884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/all_sm80_s1688tf32syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_d884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/all_sm80_z884herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_t_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_s1688syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/all_sm80_z884syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_z884herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/all_sm90_d1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_t_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_z884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/all_sm90_gz1684herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_h_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/all_sm90_gz1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_d1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/all_sm90_z1684herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_gz1684herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/all_sm90_z1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_z1684herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/all_sm80_c1688her2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_gz1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/all_sm80_c1688syr2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_z1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/all_sm80_c1688tf32her2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_h_l_align1.cu.o [ 60%] Built target cutlass_library_rank_2k_sm80_c1688syr2k_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/all_sm80_c1688tf32syr2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_h_u_align1.cu.o [ 60%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/all_sm80_d884syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_n_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/all_sm80_gz884her2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_t_l_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_c1688her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/all_sm80_gz884syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_t_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_n_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_d884syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/all_sm80_s1688syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_h_l_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_gz884syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/all_sm80_s1688tf32syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_h_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_t_l_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_gz884her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/all_sm80_z884her2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_h_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_s1688syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/all_sm80_z884syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_h_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_n_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_z884her2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/all_sm90_d1684syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_t_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/all_sm90_gz1684her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_z884syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/all_sm90_gz1684syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_h_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_h_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_t_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_gz1684her2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/all_sm90_z1684her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/all_sm90_z1684syr2k_rank_2k_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_h_l_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_d1684syr2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/all_sm80_c1688tf32trmm_trmm_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_n_l_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_h_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_n_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_l_un_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_z1684her2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/all_sm80_c1688trmm_trmm_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_t_l_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_t_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_l_un_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_z1684syr2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_l_nu_align1.cu.o [ 65%] Built target cutlass_library_trmm_sm80_c1688tf32trmm_objs [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/all_sm80_d884trmm_trmm_operations.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_l_nu_align1.cu.o [ 65%] Built target cutlass_library_trmm_sm80_c1688trmm_objs [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/all_sm80_gz884trmm_trmm_operations.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_u_un_align1.cu.o [ 66%] Built target cutlass_library_trmm_sm80_d884trmm_objs [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/all_sm80_s1688tf32trmm_trmm_operations.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_l_nu_align1.cu.o [ 67%] Built target cutlass_library_trmm_sm80_gz884trmm_objs [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/all_sm80_s1688trmm_trmm_operations.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/all_sm80_z884trmm_trmm_operations.cu.o [ 67%] Built target cutlass_library_trmm_sm80_s1688tf32trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/all_sm90_d1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_u_un_align1.cu.o [ 68%] Built target cutlass_library_trmm_sm80_s1688trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/all_sm90_gz1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_l_un_align1.cu.o [ 69%] Built target cutlass_library_trmm_sm90_d1684trmm_objs [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/all_sm90_z1684trmm_trmm_operations.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_u_un_align1.cu.o [ 69%] Built target cutlass_library_trmm_sm80_z884trmm_objs [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/all_sm80_c1688hemm_symm_operations.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_ls_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 69%] Built target cutlass_library_trmm_sm90_gz1684trmm_objs [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/all_sm80_c1688symm_symm_operations.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_ls_u_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_ls_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_rs_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688hemm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/all_sm80_c1688tf32hemm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/all_sm80_c1688tf32symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Built target cutlass_library_trmm_sm90_z1684trmm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/all_sm80_d884symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_ls_u_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688tf32hemm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/all_sm80_gz884hemm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/all_sm80_gz884symm_symm_operations.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688tf32symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/all_sm80_s1688symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_ls_l_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_gz884hemm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/all_sm80_s1688tf32symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_ls_l_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_d884symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/all_sm80_z884hemm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_rs_l_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_gz884symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/all_sm80_z884symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_ls_u_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_z884hemm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/all_sm90_d1684symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_rs_l_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_s1688tf32symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/all_sm90_gz1684hemm_symm_operations.cu.o [ 71%] Built target cutlass_library_symm_sm80_s1688symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/all_sm90_gz1684symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Built target cutlass_library_symm_sm80_z884symm_objs [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/all_sm90_z1684hemm_symm_operations.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Built target cutlass_library_symm_sm90_d1684symm_objs [ 72%] Linking CUDA static library libcutlass_symm_sm90_z1684symm.a [ 72%] Built target cutlass_library_symm_sm90_z1684symm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_cgemm.a [ 72%] Built target cutlass_library_gemm_sm50_cgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_dgemm.a [ 72%] Built target cutlass_library_gemm_sm50_dgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_sgemm.a [ 72%] Built target cutlass_library_gemm_sm50_sgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm60_hgemm.a [ 72%] Built target cutlass_library_gemm_sm60_hgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm61_igemm_s8.a [ 72%] Built target cutlass_library_gemm_sm61_igemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm61_s8_igemm_s8.a [ 72%] Built target cutlass_library_gemm_sm61_s8_igemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm_planar_complex.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm_planar_complex_array.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm.a [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_static [ 72%] Built target cutlass_library_symm_sm90_gz1684symm_objs [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm_planar_complex.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm_planar_complex_array.a [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_static [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i88128xorgemm_b1.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8816gemm_s8.a [ 72%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1_static [ 72%] Built target cutlass_library_gemm_sm75_i8816gemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8816gemm_u8.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8832gemm_s4.a [ 72%] Built target cutlass_library_gemm_sm75_i8816gemm_u8_static [ 72%] Built target cutlass_library_gemm_sm75_i8832gemm_s4_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8832gemm_u4.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_f16.a [ 72%] Built target cutlass_library_symm_sm90_gz1684hemm_objs [ 72%] Built target cutlass_library_gemm_sm75_i8832gemm_u4_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm75_s1688gemm_f16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_s4_i8832gemm_s4.a [ 73%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_s8_i8816gemm_s8.a [ 73%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_u4_i8832gemm_u4.a [ 73%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_static [ 73%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_u8_i8816gemm_u8.a [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16.a [ 73%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4_static [ 73%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.a [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_static [ 73%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_static [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.a [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_static [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_static [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_c1688gemm.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_c1688tf32gemm.a [ 74%] Built target cutlass_library_gemm_sm80_c1688gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_cgemm.a [ 74%] Built target cutlass_library_gemm_sm80_c1688tf32gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_d884gemm.a [ 74%] Built target cutlass_library_gemm_sm80_d884gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_dgemm.a [ 74%] Built target cutlass_library_gemm_sm80_dgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_static [ 74%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 74%] Built target cutlass_library_gemm_sm80_cgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.a [ 74%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16832spgemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.a [ 74%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_gz884gemm.a [ 74%] Built target cutlass_library_gemm_sm80_gz884gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_grouped.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_planar_complex.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_planar_complex_array.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_s8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16832spgemm.a [ 74%] Built target cutlass_library_gemm_sm80_h16832spgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168128spgemm_s4.a [ 74%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168256andgemm_b1.a [ 74%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168256xorgemm_b1.a [ 74%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16832gemm_s8.a [ 74%] Built target cutlass_library_gemm_sm80_i16832gemm_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16832gemm_u8.a [ 74%] Built target cutlass_library_gemm_sm80_i16832gemm_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864gemm_s4.a [ 74%] Built target cutlass_library_gemm_sm80_i16864gemm_s4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864gemm_u4.a [ 74%] Built target cutlass_library_gemm_sm80_i16864gemm_u4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864spgemm_s8.a [ 74%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16_s8.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16_u8.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16_s8.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16_u8.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_grouped_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_grouped_f16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_s8_bf16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_s8_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_u8_f16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_u8_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816tf32spgemm.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16832spgemm_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm_static [ 75%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16832spgemm_f16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688bf16gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16_static [ 75%] Built target cutlass_library_gemm_sm80_s1688bf16gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688gemm.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688f16gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688gemm_static [ 75%] Built target cutlass_library_gemm_sm80_s1688f16gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688tf32gemm.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688gemm_tf32.a [ 75%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32_static [ 75%] Built target cutlass_library_gemm_sm80_s1688tf32gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s4_i168128spgemm_s4.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s4_i16864gemm_s4.a [ 75%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4_static [ 75%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_s8_i16832gemm_s8.a [ 76%] Linking CUDA static library libcutlass_gemm_sm80_s8_i16864spgemm_s8.a [ 76%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8_static [ 76%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_sgemm.a [ 76%] Linking CUDA static library libcutlass_gemm_sm80_tf32_s1688gemm_tf32.a [ 76%] Built target cutlass_library_gemm_sm80_sgemm_static [ 76%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_u4_i16864gemm_u4.a [ 76%] Linking CUDA static library libcutlass_gemm_sm80_u8_i16832gemm_u8.a [ 76%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4_static [ 76%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_z884gemm.a [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16fprop_optimized.a [ 76%] Built target cutlass_library_gemm_sm80_z884gemm_static [ 76%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.a [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_static [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.a [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_static [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.a [ 76%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_static [ 76%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.a [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_static [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.a [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_static [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.a [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_static [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.a [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a [ 77%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_d1684gemm.a [ 77%] Built target cutlass_library_gemm_sm90_d1684gemm_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_gz1684gemm.a [ 79%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_h64x128x16gemm.a [ 79%] Built target cutlass_library_gemm_sm90_gz1684gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8_static [ 79%] Built target cutlass_library_gemm_sm90_h64x128x16gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_bf16.a [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_f16.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_static [ 79%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e4m3.a [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x8gemm_tf32.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x8tf32gemm.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_static [ 79%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.a [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_static [ 79%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_i64x128x32gemm_u8.a [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.a [ 79%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x16gemm_f16.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_static [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.a [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_static [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.a [ 79%] Linking CUDA static library libcutlass_gemm_sm90_z1684gemm.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_static [ 79%] Built target cutlass_library_gemm_sm90_z1684gemm_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_sdgrad_optimized.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_static [ 79%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_sfprop_optimized.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_swgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_sfprop_optimized_static [ 79%] Built target cutlass_library_conv2d_sm50_swgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm60_hfprop_optimized.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm60_hfprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884dgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884fprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized_static [ 79%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884wgrad_optimized.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_static [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_static [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_static [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.a [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_static [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688dgrad_optimized.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_few_channels.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels_static [ 80%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_fixed_channels.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_static [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688wgrad_optimized.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8816fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized_static [ 80%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8816fprop_optimized_u8.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8832fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_static [ 80%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8832fprop_optimized_u4.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_static [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_static [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_static [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_static [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_static [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_static [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816fprop_fixed_channels.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816dgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_static [ 80%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816fprop_optimized.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816wgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized_static [ 80%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16832fprop_optimized_s8.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16832fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_static [ 80%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16864fprop_optimized_s4.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16864fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_static [ 80%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.a [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_static [ 80%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_static [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_optimized_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_static [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_static [ 81%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_static [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_static [ 81%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_static [ 81%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16wgrad_optimized.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_static [ 81%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_static [ 81%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32fprop_optimized.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_static [ 81%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.a [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.a [ 81%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.a [ 81%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.a [ 82%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_sdgrad_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_sfprop_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_swgrad_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_sfprop_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_swgrad_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_static [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.a [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_static [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.a [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_static [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.a [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_static [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.a [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_static [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_static [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_static [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_static [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_static [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816dgrad3d_analytic.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_static [ 82%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816dgrad3d_optimized.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816fprop3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_static [ 82%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816wgrad3d_optimized.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_static [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_static [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_static [ 82%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_static [ 82%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_static [ 82%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_static [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.a [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688herk.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_static [ 82%] Built target cutlass_library_rank_k_sm80_c1688herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688syrk.a [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688tf32herk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688syrk_static [ 82%] Built target cutlass_library_rank_k_sm80_c1688tf32herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688tf32syrk.a [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_d884syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk_static [ 82%] Built target cutlass_library_rank_k_sm80_d884syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_gz884herk.a [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_gz884syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_gz884herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_s1688syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_gz884syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_s1688tf32syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_s1688syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_z884herk.a [ 83%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_z884syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_z884herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_d1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_z884syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_gz1684herk.a [ 83%] Built target cutlass_library_rank_k_sm90_d1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_gz1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_gz1684herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_z1684herk.a [ 83%] Built target cutlass_library_rank_k_sm90_gz1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_z1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_z1684herk_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688her2k.a [ 83%] Built target cutlass_library_rank_k_sm90_z1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688tf32her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688tf32syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_d884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_gz884her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_d884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_gz884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_gz884her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_s1688syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_gz884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_s1688tf32syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_s1688syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_z884her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_z884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_z884her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_d1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_z884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_gz1684her2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_d1684syr2k_static [ 83%] Built target cutlass_library_rank_2k_sm90_gz1684her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_gz1684syr2k.a [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_z1684her2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_z1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_z1684her2k_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_c1688tf32trmm.a [ 83%] Built target cutlass_library_rank_2k_sm90_z1684syr2k_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_c1688trmm.a [ 83%] Built target cutlass_library_trmm_sm80_c1688tf32trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_d884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_c1688trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_gz884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_d884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_s1688tf32trmm.a [ 83%] Built target cutlass_library_trmm_sm80_gz884trmm_static [ 83%] Built target cutlass_library_trmm_sm80_s1688tf32trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_s1688trmm.a [ 83%] Linking CUDA static library libcutlass_trmm_sm80_z884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_s1688trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_d1684trmm.a [ 83%] Built target cutlass_library_trmm_sm80_z884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_gz1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_d1684trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_z1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_gz1684trmm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688hemm.a [ 83%] Built target cutlass_library_trmm_sm90_z1684trmm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688tf32hemm.a [ 83%] Built target cutlass_library_symm_sm80_c1688symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688tf32symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688tf32hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_d884symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688tf32symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_gz884hemm.a [ 83%] Built target cutlass_library_symm_sm80_d884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_gz884symm.a [ 83%] Built target cutlass_library_symm_sm80_gz884hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_s1688symm.a [ 83%] Built target cutlass_library_symm_sm80_gz884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_s1688tf32symm.a [ 83%] Built target cutlass_library_symm_sm80_s1688symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_z884hemm.a [ 83%] Built target cutlass_library_symm_sm80_s1688tf32symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_z884symm.a [ 83%] Built target cutlass_library_symm_sm80_z884hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_d1684symm.a [ 83%] Built target cutlass_library_symm_sm80_z884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_gz1684hemm.a [ 83%] Built target cutlass_library_symm_sm90_d1684symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_gz1684symm.a [ 83%] Built target cutlass_library_symm_sm90_gz1684hemm_static [ 83%] Linking CUDA shared library libcutlass_symm_sm90_z1684symm.so [ 83%] Built target cutlass_library_symm_sm90_gz1684symm_static [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_cgemm.so [ 83%] Built target cutlass_library_symm_sm90_z1684symm [ 83%] Built target cutlass_library_gemm_sm50_cgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_dgemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_sgemm.so [ 83%] Built target cutlass_library_gemm_sm50_dgemm [ 83%] Built target cutlass_library_gemm_sm50_sgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm60_hgemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm61_igemm_s8.so [ 83%] Built target cutlass_library_gemm_sm61_igemm_s8 [ 83%] Built target cutlass_library_gemm_sm60_hgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm61_s8_igemm_s8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm61_s8_igemm_s8 [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16 [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm_planar_complex.so [ 83%] Built target cutlass_library_gemm_sm70_h884gemm [ 83%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm_planar_complex_array.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_f16 [ 83%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16 [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16 [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm_planar_complex.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i88128xorgemm_b1.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8816gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm75_i8816gemm_s8 [ 83%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8816gemm_u8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8832gemm_s4.so [ 83%] Built target cutlass_library_gemm_sm75_i8816gemm_u8 [ 83%] Built target cutlass_library_gemm_sm75_i8832gemm_s4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8832gemm_u4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm75_i8832gemm_u4 [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16 [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s4_i8832gemm_s4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s8_i8816gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4 [ 83%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_u4_i8832gemm_u4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_u8_i8816gemm_u8.so [ 83%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4 [ 83%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_c1688gemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_c1688tf32gemm.so [ 83%] Built target cutlass_library_gemm_sm80_c1688gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_cgemm.so [ 83%] Built target cutlass_library_gemm_sm80_c1688tf32gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_d884gemm.so [ 83%] Built target cutlass_library_gemm_sm80_d884gemm [ 83%] Built target cutlass_library_gemm_sm80_cgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_dgemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm80_dgemm [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8 [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16 [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16 [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16832spgemm_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_gz884gemm.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16 [ 83%] Built target cutlass_library_gemm_sm80_gz884gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_grouped.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_planar_complex.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_s8_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16832spgemm.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16 [ 83%] Built target cutlass_library_gemm_sm80_h16832spgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168128spgemm_s4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168256andgemm_b1.so [ 83%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4 [ 83%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168256xorgemm_b1.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i16832gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1 [ 83%] Built target cutlass_library_gemm_sm80_i16832gemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i16832gemm_u8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864gemm_s4.so [ 84%] Built target cutlass_library_gemm_sm80_i16832gemm_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864gemm_u4.so [ 84%] Built target cutlass_library_gemm_sm80_i16864gemm_s4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864spgemm_s8.so [ 84%] Built target cutlass_library_gemm_sm80_i16864gemm_u4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16_s8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16_u8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16_s8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16_u8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_grouped_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_s8_bf16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_s8_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_u8_bf16.so [ 84%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16 [ 84%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_u8_f16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816tf32spgemm.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16 [ 84%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16832spgemm_bf16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16832spgemm_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16 [ 84%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688bf16gemm.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688f16gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s1688bf16gemm [ 84%] Built target cutlass_library_gemm_sm80_s1688f16gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688gemm.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688gemm_tf32.so [ 84%] Built target cutlass_library_gemm_sm80_s1688gemm [ 84%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688tf32gemm.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s4_i168128spgemm_s4.so [ 84%] Built target cutlass_library_gemm_sm80_s1688tf32gemm [ 84%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s4_i16864gemm_s4.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s8_i16832gemm_s8.so [ 84%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4 [ 84%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s8_i16864spgemm_s8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_sgemm.so [ 84%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8 [ 84%] Built target cutlass_library_gemm_sm80_sgemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_u4_i16864gemm_u4.so [ 84%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4 [ 84%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_u8_i16832gemm_u8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_z884gemm.so [ 84%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8 [ 84%] Built target cutlass_library_gemm_sm80_z884gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3 [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2 [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e4m3.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so [ 84%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3 [ 84%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2 [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3 [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3 [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2 [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3 [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16 [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2 [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_d1684gemm.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_d1684gemm [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_gz1684gemm.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3 [ 85%] Built target cutlass_library_gemm_sm90_gz1684gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_i64x128x32gemm_s8.so [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_h64x128x16gemm.so [ 86%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_i64x128x32gemm_u8.so [ 86%] Built target cutlass_library_gemm_sm90_h64x128x16gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x16gemm_bf16.so [ 86%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x16gemm_f16.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x8gemm_tf32.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x8tf32gemm.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so [ 86%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8 [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so [ 87%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8 [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so [ 87%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8 [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so [ 87%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8 [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so [ 87%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_z1684gemm.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so [ 88%] Built target cutlass_library_gemm_sm90_z1684gemm [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_sdgrad_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_sfprop_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_swgrad_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_sfprop_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm60_hfprop_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_swgrad_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so [ 88%] Built target cutlass_library_conv2d_sm60_hfprop_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884dgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884fprop_optimized.so [ 89%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884wgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so [ 89%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so [ 89%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688dgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_few_channels.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688wgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so [ 89%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so [ 89%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so [ 89%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8 [ 89%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so [ 90%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8 [ 90%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8 [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8 [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16 [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16 [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16 [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816dgrad_optimized.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16 [ 91%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816fprop_optimized.so [ 91%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels [ 91%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816wgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so [ 92%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized [ 92%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so [ 92%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8 [ 92%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4 [ 92%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688dgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32 [ 92%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16fprop_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688fprop_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32 [ 92%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688wgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32 [ 93%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_sdgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8 [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_sfprop_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_swgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_swgrad_optimized [ 93%] Built target cutlass_library_conv2d_sm80_sfprop_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32 [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so [ 93%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8 [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3 [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so [ 93%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16 [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32 [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so [ 93%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16 [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16 [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16 [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16 [ 94%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816fprop3d_optimized.so [ 94%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized [ 95%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16 [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16 [ 95%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so [ 96%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16 [ 96%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32 [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688herk.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688syrk.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688tf32herk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688tf32syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_d884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688tf32herk [ 96%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_gz884herk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_gz884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_d884syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_s1688syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_gz884herk [ 96%] Built target cutlass_library_rank_k_sm80_gz884syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_s1688tf32syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_z884herk.so [ 96%] Built target cutlass_library_rank_k_sm80_s1688syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_z884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_z884herk [ 96%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_d1684syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_gz1684herk.so [ 96%] Built target cutlass_library_symm_sm90_z1684hemm_objs [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_gz1684syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_z884syrk [ 97%] Linking CUDA shared library libcutlass_rank_k_sm90_z1684herk.so [ 97%] Built target cutlass_library_rank_k_sm90_gz1684herk [ 97%] Built target cutlass_library_rank_k_sm90_d1684syrk [ 97%] Linking CUDA shared library libcutlass_rank_k_sm90_z1684syrk.so [ 97%] Built target cutlass_library_rank_k_sm90_gz1684syrk [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688her2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688syr2k.so [ 97%] Built target cutlass_library_rank_k_sm90_z1684herk [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688tf32her2k.so [ 97%] Built target cutlass_library_rank_k_sm90_z1684syrk [ 97%] Built target cutlass_library_rank_2k_sm80_c1688her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688tf32syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_d884syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_gz884her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_gz884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_d884syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_gz884her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_s1688syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_s1688tf32syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_z884her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_gz884syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_z884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_s1688syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_z884her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_d1684syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_z884syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_gz1684her2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_gz1684syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_z1684her2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_d1684syr2k [ 97%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k [ 97%] Built target cutlass_library_rank_2k_sm90_gz1684her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_z1684syr2k.so [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_c1688tf32trmm.so [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_c1688trmm.so [ 97%] Built target cutlass_library_rank_2k_sm90_z1684her2k [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_d884trmm.so [ 97%] Built target cutlass_library_rank_2k_sm90_z1684syr2k [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_gz884trmm.so [ 97%] Built target cutlass_library_trmm_sm80_d884trmm [ 97%] Built target cutlass_library_trmm_sm80_c1688tf32trmm [ 97%] Built target cutlass_library_trmm_sm80_c1688trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_s1688trmm.so [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_s1688tf32trmm.so [ 98%] Linking CUDA shared library libcutlass_trmm_sm80_z884trmm.so [ 98%] Built target cutlass_library_trmm_sm80_gz884trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_d1684trmm.so [ 98%] Built target cutlass_library_trmm_sm80_s1688tf32trmm [ 98%] Built target cutlass_library_trmm_sm80_s1688trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_gz1684trmm.so [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_z1684trmm.so [ 98%] Built target cutlass_library_trmm_sm80_z884trmm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688hemm.so [ 98%] Built target cutlass_library_trmm_sm90_d1684trmm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688symm.so [ 98%] Built target cutlass_library_trmm_sm90_gz1684trmm [ 98%] Built target cutlass_library_symm_sm80_c1688hemm [ 98%] Built target cutlass_library_trmm_sm90_z1684trmm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688tf32hemm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_c1688tf32symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_d884symm.so [ 99%] Built target cutlass_library_symm_sm80_c1688symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_gz884hemm.so [ 99%] Built target cutlass_library_symm_sm80_c1688tf32hemm [ 99%] Built target cutlass_library_symm_sm80_c1688tf32symm [ 99%] Built target cutlass_library_symm_sm80_d884symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_gz884symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_s1688symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_s1688tf32symm.so [ 99%] Built target cutlass_library_symm_sm80_gz884hemm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_z884hemm.so [ 99%] Built target cutlass_library_symm_sm80_gz884symm [ 99%] Built target cutlass_library_symm_sm80_s1688symm [ 99%] Built target cutlass_library_symm_sm80_s1688tf32symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_z884symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm90_d1684symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm90_gz1684hemm.so [ 99%] Built target cutlass_library_symm_sm80_z884hemm [ 99%] Linking CUDA shared library libcutlass_symm_sm90_gz1684symm.so [ 99%] Built target cutlass_library_symm_sm80_z884symm [ 99%] Built target cutlass_library_symm_sm90_gz1684hemm [ 99%] Built target cutlass_library_symm_sm90_d1684symm [ 99%] Linking CUDA shared library libcutlass_symm_sm90_z1684hemm.so [ 99%] Linking CUDA static library libcutlass_symm_sm90_z1684hemm.a [ 99%] Built target cutlass_library_symm_sm90_z1684hemm_static [ 99%] Linking CXX static library libcutlass.a [ 99%] Built target cutlass_library_symm_sm90_gz1684symm [ 99%] Built target cutlass_library_symm_sm90_z1684hemm [ 99%] Linking CXX shared library libcutlass.so [ 99%] Built target cutlass_library_static [ 99%] Built target cutlass_library [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/main.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/options.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cutlass_profiler.cu.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/performance_report.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/enumerated_types.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gpu_timer.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_allocation.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_context.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cublas_helpers.cu.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cudnn_helpers.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/problem_space.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gemm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/rank_k_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/rank_2k_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/trmm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/symm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv2d_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv3d_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/sparse_gemm_operation_profiler.cu.o [100%] Linking CXX executable cutlass_profiler [100%] Built target cutlass_profiler + popd ~/build/BUILD/cutlass + exit 0 Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.4fypzs + umask 022 + cd /builddir/build/BUILD + '[' /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 '!=' / ']' + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 ++ dirname /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 + mkdir -p /builddir/build/BUILDROOT + mkdir /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 + cd cutlass + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 + pushd build ~/build/BUILD/cutlass/build ~/build/BUILD/cutlass + DESTDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 + /usr/bin/cmake --install . -- Install configuration: "Release" -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/functional.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/functional.h.fp16~ -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/workspace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/wmma_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/version.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/uint128.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/warp/vector_fragment_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_tensor_op_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear_direct_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/regular_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_iterator_triangular_matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_iterator_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_triangular_matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_scale_bias_vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/predicated_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/ell_predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/ell_predicated_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/threadblock/ell_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/thread/unary_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/thread/transpose.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/pitch_linear_thread_map.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/transform/collective/sm90_wgmma_transpose.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/trace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/thread/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/tfloat32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/tensor_view_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/tensor_view.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/tensor_ref_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/tensor_ref.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/tensor_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/subbyte_reference.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/semaphore.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/relatively_equal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/thread/reduction_operators.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/thread/reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/kernel/tensor_reduce_affine_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/kernel/tensor_reduce_affine_contiguous.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/kernel/reduce_split_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/kernel/reduce_softmax_final.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/device/tensor_reduce_affine_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/device/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/reduction/device/reduce_split_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/real.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/quaternion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/predicate_vector.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/platform -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/platform/platform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/pitch_linear_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/pipeline -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/pipeline/sm90_pipeline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/pipeline/pipeline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/numeric_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/numeric_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/numeric_conversion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/matrix_shape.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/matrix_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/vector.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/tensor_op_multiplicand_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/tensor_op_multiplicand_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/tensor_op_multiplicand_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/tensor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/permute.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/layout/layout.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/kernel_launch.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/kernel_hardware_info.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/kernel_hardware_info.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/integer_subbyte.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/half.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm_coord.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/tile_iterator_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/softmax_scale_bias_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/scale_bias_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_with_reduction_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_fragment_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op_fast_f32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_sparse_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_simt_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_simt_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_gaussian_complex_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_gaussian_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_complex_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_complex_tensor_op_fast_f32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/layernorm_scale_bias_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/default_mma_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/default_mma_with_reduction_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/default_mma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/default_mma_sparse_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/warp/default_mma_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/threadblock_swizzle_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_with_reduction_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_sparse_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_sparse_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_softmax_mainloop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_singlestage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_planar_complex_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_planar_complex_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_planar_complex_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_layernorm_mainloop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_blas3_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/mma_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/index_remat.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/ell_mma_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/ell_mma_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_sparse_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_multistage_trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex_core_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_planar_complex_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_planar_complex_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_with_access_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_sparse_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_gemv_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/threadblock/default_ell_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/thread/mma_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/thread/mma_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/thread/mma_sm50.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/thread/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/trmm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/tile_scheduler_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/symm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/static_tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sparse_gemm_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sparse_gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sparse_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/sm70_gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/rank_k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/rank_2k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/rank_2k_transpose_operands.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/rank_2k_grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/params_universal_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/params_sparse_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemv_batched_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_universal_with_visitor_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_universal_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_universal.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_transpose_operands.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_streamk_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_planar_complex_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_grouped_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_batched.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_trmm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_symm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_symm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_2k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_universal_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_sparse_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_sparse_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_planar_complex_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_grouped_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/kernel/default_ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/group_array_problem_shape.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/gemm_enumerated_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/rank_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_universal_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_universal_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_universal_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_universal_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_universal_adapter.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_sparse_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_sparse_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_batched.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/default_gemm_configuration.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/device/base_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm80_mma_multistage.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/sm70_mma_twostage.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/fp8_accumulation.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/collective_mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/gemm/collective/builders/sm90_common.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/floating_point_nvrtc.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/float8.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/fast_math.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/wmma_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/volta_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/tile_iterator_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/tile_iterator_tensor_op_mixed.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/tile_iterator_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/simt_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/fragment_iterator_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/fragment_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/fragment_iterator_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/fragment_iterator_gaussian_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/warp/fragment_iterator_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/shared_load_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/shared_load_iterator_mixed.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/shared_load_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_predicates.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_direct_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_affine_layout_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_affine.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/output_tile_thread_map.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/output_iterator_parameter.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/interleaved_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/fusion -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/fusion/visitors.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_store.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_compute.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_2x.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_workspace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_with_visitor_callbacks.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_visitor_with_softmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_smem_accumulator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_gemm_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_direct_store.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_depthwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_base_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/direct_store_epilogue_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_thread_map_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_direct_store.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_complex_tensor_op_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/threadblock/default_epilogue_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/scale_type.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/reduction_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_with_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_tensor_broadcast.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_silu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_sigmoid.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_residual_block.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_relu0.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_leaky_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_hardswish.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_generic_with_scaling.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_generic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_gelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_drelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_dgelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_clamp.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_bias_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/linear_combination.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/conversion_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/thread/activation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/operations.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/fusion/callbacks.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/epilogue_tensor_broadcast.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/default_epilogue_array.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/default_epilogue.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/collective_epilogue.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/epilogue/collective/builders/sm90_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/device_kernel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/detail -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/detail/mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/detail/layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/detail/helper_macros.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/detail/dependent_false.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/detail/collective.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/cutlass.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/cuda_host_adapter.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/core_io.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/warp/scale_bias_relu_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/warp/mma_depthwise_simt_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/warp/mma_depthwise_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/predicated_scale_bias_vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/predicated_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/implicit_gemm_wgrad_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/implicit_gemm_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/implicit_gemm_fprop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_mma_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_fprop_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_fprop_filter_tile_access_iterator_direct_conv_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_fprop_direct_conv_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/depthwise_direct_conv_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_fprop_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_fprop_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_fixed_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_few_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_fixed_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_few_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/thread/depthwise_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_strided_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/direct_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_depthwise_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_deconv3d_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_deconv3d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_deconv2d_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_deconv2d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv3d_wgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv3d_fprop_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv3d_fprop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv3d_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv3d_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_wgrad_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_wgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_group_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/default_conv2d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/kernel/conv_universal.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/device/implicit_gemm_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/device/direct_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/device/conv_universal_adapter.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/convnd_problem_shape.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/conv3d_problem_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/conv2d_problem_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/collective_conv.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/builders/sm90_gmma_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/conv/collective/builders/sm90_common.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/constants.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/cluster_launch.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/block_striped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/blas3_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/bfloat16.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/barrier.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/array_subbyte.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/array_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/wmma_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/wmma_sm72.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/wmma_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/simd_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/simd_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/simd.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/reg_reconfig.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sparse_sm89.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sparse_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm90.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm89.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma_sm50.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/memory_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/memory_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/memory.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/cache_operation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/barrier.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/arch/arch.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/aligned_buffer.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/util -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/util/type_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/util/print.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/util/debug.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/underscore.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/tensor_predicate.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/tensor.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/swizzle_layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/stride.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/pointer_swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/pointer_flagged.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/pointer_base.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/pointer.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/real.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/numeric_types.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/math.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/integral_ratio.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/integral_constant.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/integer_sequence.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/int.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/complex.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/numeric/arithmetic_tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/layout_composed.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/int_tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/type_list.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/cuda_types.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/bit_field.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/array_subbyte.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/array_aligned.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/array.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/container/alignment.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/config.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits_sm90_gmma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits_sm70.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits_sm61.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/mma_atom.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm90_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm90_im2col.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits_sm50.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/atom/copy_atom.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/util.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm90_gmma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm90_desc.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm70.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma_sm61.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy_sm90_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy_sm90_desc.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy_sm50.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/arch/cluster_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/tuple_algorithms.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/tensor_algorithms.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/prefetch.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/prefer.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/functional.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/fill.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/cooperative_gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/cooperative_copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/clear.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cute/algorithm/axpby.hpp -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/cutlass/version_extended.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test/cutlass/bin -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test/cutlass/lib64 -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test/cutlass/ctest -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/ -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/type_traits.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/tensor_view_io.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_reduce.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_norm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_fill.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_fill.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_copy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_compare.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/tensor_compare.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/symm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/rank_k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/gett.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/error_metrics.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/host/conv.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/thread/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/tensor_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/tensor_fill.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/tensor_compare.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/kernel/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/kernel/tensor_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/kernel/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/gett.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/device/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/detail -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/detail/linear_to_coordinate.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/reference/detail/inner_product.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/print_error.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/packed_stride.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/index_sequence.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/host_uncompress.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/host_tensor_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/host_tensor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/host_reorder.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/helper_cuda.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/gett_commandline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/exceptions.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/distribution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_utils.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_rmsnorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_nhwc_to_nchw.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_nhwc_pooling.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_nhwc_padding.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_nchw_to_nhwc.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_memory.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_layernorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_groupnorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/device_dump.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/debug.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/cublas_wrappers.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/command_line.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/util/GPU_Clock.hpp -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include/ -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/util.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/singleton.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/operation_table.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/manifest.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/library.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/handle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/descriptions.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/include//cutlass/library/arch_mappings.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_cgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_cgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_dgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_dgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_sgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_sgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm60_hgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm60_hgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm61_igemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm61_igemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_c1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_c1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_cgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_cgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_d884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_d884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_dgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_dgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_gz884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_gz884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_sgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_sgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_z884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_z884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_d1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_d1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_z1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_z1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_d884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_d884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_gz884herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_gz884herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_z884herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_z884herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_z884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_z884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_z1684herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_z1684herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_c1688trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_c1688trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_d884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_d884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_gz884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_gz884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_s1688trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_s1688trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_z884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_z884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_d1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_d1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_z1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_z1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_d884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_d884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_gz884hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_gz884hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_gz884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_gz884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_s1688symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_s1688symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_z884hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_z884hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_z884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_z884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_d1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_d1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_gz1684hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_gz1684hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_gz1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_gz1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_z1684hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_z1684hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_z1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_z1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/info/cutlass/generated_kernels.txt -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/bin/cutlass_profiler -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test/cutlass/ctest/ctest_profiler/CTestTestfile.ctest_profiler.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test/cutlass/CTestTestfile.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassConfig.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassConfigVersion.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassTargets.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassTargets-release.cmake + popd ~/build/BUILD/cutlass + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/test + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/info + set +x Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/bin/cutlass_profiler Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_cgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_dgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm50_sgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm60_hgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm61_igemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_c1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_cgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_d884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_dgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_gz884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_sgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm80_z884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_d1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_gemm_sm90_z1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_d884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_gz884herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_z884herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm80_z884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_z1684herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_d884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_gz884hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_gz884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_s1688symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_z884hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm80_z884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_d1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_gz1684hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_gz1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_z1684hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_symm_sm90_z1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_c1688trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_d884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_gz884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_s1688trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm80_z884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_d1684trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/lib64/libcutlass_trmm_sm90_z1684trmm.so + /usr/lib/rpm/check-buildroot + /usr/lib/rpm/redhat/brp-ldconfig /sbin/ldconfig: Warning: ignoring configuration file that cannot be opened: /etc/ld.so.conf: No such file or directory + /usr/lib/rpm/brp-compress + /usr/lib/rpm/brp-strip /usr/bin/strip + /usr/lib/rpm/brp-strip-comment-note /usr/bin/strip /usr/bin/objdump + /usr/lib/rpm/brp-strip-static-archive /usr/bin/strip + /usr/lib/rpm/brp-python-bytecompile '' 1 + /usr/lib/rpm/brp-python-hardlink + PYTHON3=/usr/bin/python3.6 + /usr/lib/rpm/redhat/brp-mangle-shebangs Processing files: cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 Executing(%doc): /bin/sh -e /var/tmp/rpm-tmp.WB8fUR + umask 022 + cd /builddir/build/BUILD + cd cutlass + DOCDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/doc/cutlass + export LC_ALL=C + LC_ALL=C + export DOCDIR + /usr/bin/mkdir -p /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/doc/cutlass + cp -pr README.md /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/doc/cutlass + cp -pr docs /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/doc/cutlass + exit 0 Executing(%license): /bin/sh -e /var/tmp/rpm-tmp.SH3GfJ + umask 022 + cd /builddir/build/BUILD + cd cutlass + LICENSEDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/licenses/cutlass + export LC_ALL=C + LC_ALL=C + export LICENSEDIR + /usr/bin/mkdir -p /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/licenses/cutlass + cp -pr LICENSE.txt /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64/usr/share/licenses/cutlass + exit 0 Provides: cutlass = 3.5.0-20240411.1.cu12_4.el8 cutlass(x86-64) = 3.5.0-20240411.1.cu12_4.el8 libcutlass.so()(64bit) libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm50_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm50_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm60_hfprop_optimized.so()(64bit) libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_h884dgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_h884fprop_optimized.so()(64bit) libcutlass_conv2d_sm70_h884wgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_h1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_few_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_h16816dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm80_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so()(64bit) libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816fprop3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_gemm_sm50_cgemm.so()(64bit) libcutlass_gemm_sm50_dgemm.so()(64bit) libcutlass_gemm_sm50_sgemm.so()(64bit) libcutlass_gemm_sm60_hgemm.so()(64bit) libcutlass_gemm_sm61_igemm_s8.so()(64bit) libcutlass_gemm_sm61_s8_igemm_s8.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm70_h884gemm.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm70_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_h1688gemm.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm75_i88128xorgemm_b1.so()(64bit) libcutlass_gemm_sm75_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm75_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_s4_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_s8_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_u4_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_u8_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_c1688gemm.so()(64bit) libcutlass_gemm_sm80_c1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_cgemm.so()(64bit) libcutlass_gemm_sm80_d884gemm.so()(64bit) libcutlass_gemm_sm80_dgemm.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_gz884gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm_grouped.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm80_h16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_h16832spgemm.so()(64bit) libcutlass_gemm_sm80_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_i168256andgemm_b1.so()(64bit) libcutlass_gemm_sm80_i168256xorgemm_b1.so()(64bit) libcutlass_gemm_sm80_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_s16816tf32spgemm.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_s1688bf16gemm.so()(64bit) libcutlass_gemm_sm80_s1688f16gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_s1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_s4_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_s4_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_s8_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_s8_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_sgemm.so()(64bit) libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_u4_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_u8_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_z884gemm.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_d1684gemm.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_gz1684gemm.so()(64bit) libcutlass_gemm_sm90_h64x128x16gemm.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x8gemm_tf32.so()(64bit) libcutlass_gemm_sm90_s64x128x8tf32gemm.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_z1684gemm.so()(64bit) libcutlass_rank_2k_sm80_c1688her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_d884syr2k.so()(64bit) libcutlass_rank_2k_sm80_gz884her2k.so()(64bit) libcutlass_rank_2k_sm80_gz884syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_z884her2k.so()(64bit) libcutlass_rank_2k_sm80_z884syr2k.so()(64bit) libcutlass_rank_2k_sm90_d1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684her2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_z1684her2k.so()(64bit) libcutlass_rank_2k_sm90_z1684syr2k.so()(64bit) libcutlass_rank_k_sm80_c1688herk.so()(64bit) libcutlass_rank_k_sm80_c1688syrk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32herk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_d884syrk.so()(64bit) libcutlass_rank_k_sm80_gz884herk.so()(64bit) libcutlass_rank_k_sm80_gz884syrk.so()(64bit) libcutlass_rank_k_sm80_s1688syrk.so()(64bit) libcutlass_rank_k_sm80_s1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_z884herk.so()(64bit) libcutlass_rank_k_sm80_z884syrk.so()(64bit) libcutlass_rank_k_sm90_d1684syrk.so()(64bit) libcutlass_rank_k_sm90_gz1684herk.so()(64bit) libcutlass_rank_k_sm90_gz1684syrk.so()(64bit) libcutlass_rank_k_sm90_z1684herk.so()(64bit) libcutlass_rank_k_sm90_z1684syrk.so()(64bit) libcutlass_symm_sm80_c1688hemm.so()(64bit) libcutlass_symm_sm80_c1688symm.so()(64bit) libcutlass_symm_sm80_c1688tf32hemm.so()(64bit) libcutlass_symm_sm80_c1688tf32symm.so()(64bit) libcutlass_symm_sm80_d884symm.so()(64bit) libcutlass_symm_sm80_gz884hemm.so()(64bit) libcutlass_symm_sm80_gz884symm.so()(64bit) libcutlass_symm_sm80_s1688symm.so()(64bit) libcutlass_symm_sm80_s1688tf32symm.so()(64bit) libcutlass_symm_sm80_z884hemm.so()(64bit) libcutlass_symm_sm80_z884symm.so()(64bit) libcutlass_symm_sm90_d1684symm.so()(64bit) libcutlass_symm_sm90_gz1684hemm.so()(64bit) libcutlass_symm_sm90_gz1684symm.so()(64bit) libcutlass_symm_sm90_z1684hemm.so()(64bit) libcutlass_symm_sm90_z1684symm.so()(64bit) libcutlass_trmm_sm80_c1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_c1688trmm.so()(64bit) libcutlass_trmm_sm80_d884trmm.so()(64bit) libcutlass_trmm_sm80_gz884trmm.so()(64bit) libcutlass_trmm_sm80_s1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_s1688trmm.so()(64bit) libcutlass_trmm_sm80_z884trmm.so()(64bit) libcutlass_trmm_sm90_d1684trmm.so()(64bit) libcutlass_trmm_sm90_gz1684trmm.so()(64bit) libcutlass_trmm_sm90_z1684trmm.so()(64bit) Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Requires: ld-linux-x86-64.so.2()(64bit) ld-linux-x86-64.so.2(GLIBC_2.3)(64bit) libc.so.6()(64bit) libc.so.6(GLIBC_2.14)(64bit) libc.so.6(GLIBC_2.2.5)(64bit) libcuda.so.1()(64bit) libcudart.so.12()(64bit) libcudart.so.12(libcudart.so.12)(64bit) libcutlass.so()(64bit) libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm50_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm50_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm60_hfprop_optimized.so()(64bit) libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_h884dgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_h884fprop_optimized.so()(64bit) libcutlass_conv2d_sm70_h884wgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_h1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_few_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_h16816dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm80_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so()(64bit) libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816fprop3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_gemm_sm50_cgemm.so()(64bit) libcutlass_gemm_sm50_dgemm.so()(64bit) libcutlass_gemm_sm50_sgemm.so()(64bit) libcutlass_gemm_sm60_hgemm.so()(64bit) libcutlass_gemm_sm61_igemm_s8.so()(64bit) libcutlass_gemm_sm61_s8_igemm_s8.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm70_h884gemm.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm70_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_h1688gemm.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm75_i88128xorgemm_b1.so()(64bit) libcutlass_gemm_sm75_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm75_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_s4_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_s8_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_u4_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_u8_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_c1688gemm.so()(64bit) libcutlass_gemm_sm80_c1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_cgemm.so()(64bit) libcutlass_gemm_sm80_d884gemm.so()(64bit) libcutlass_gemm_sm80_dgemm.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_gz884gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm_grouped.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm80_h16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_h16832spgemm.so()(64bit) libcutlass_gemm_sm80_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_i168256andgemm_b1.so()(64bit) libcutlass_gemm_sm80_i168256xorgemm_b1.so()(64bit) libcutlass_gemm_sm80_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_s16816tf32spgemm.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_s1688bf16gemm.so()(64bit) libcutlass_gemm_sm80_s1688f16gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_s1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_s4_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_s4_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_s8_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_s8_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_sgemm.so()(64bit) libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_u4_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_u8_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_z884gemm.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_d1684gemm.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_gz1684gemm.so()(64bit) libcutlass_gemm_sm90_h64x128x16gemm.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x8gemm_tf32.so()(64bit) libcutlass_gemm_sm90_s64x128x8tf32gemm.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_z1684gemm.so()(64bit) libcutlass_rank_2k_sm80_c1688her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_d884syr2k.so()(64bit) libcutlass_rank_2k_sm80_gz884her2k.so()(64bit) libcutlass_rank_2k_sm80_gz884syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_z884her2k.so()(64bit) libcutlass_rank_2k_sm80_z884syr2k.so()(64bit) libcutlass_rank_2k_sm90_d1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684her2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_z1684her2k.so()(64bit) libcutlass_rank_2k_sm90_z1684syr2k.so()(64bit) libcutlass_rank_k_sm80_c1688herk.so()(64bit) libcutlass_rank_k_sm80_c1688syrk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32herk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_d884syrk.so()(64bit) libcutlass_rank_k_sm80_gz884herk.so()(64bit) libcutlass_rank_k_sm80_gz884syrk.so()(64bit) libcutlass_rank_k_sm80_s1688syrk.so()(64bit) libcutlass_rank_k_sm80_s1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_z884herk.so()(64bit) libcutlass_rank_k_sm80_z884syrk.so()(64bit) libcutlass_rank_k_sm90_d1684syrk.so()(64bit) libcutlass_rank_k_sm90_gz1684herk.so()(64bit) libcutlass_rank_k_sm90_gz1684syrk.so()(64bit) libcutlass_rank_k_sm90_z1684herk.so()(64bit) libcutlass_rank_k_sm90_z1684syrk.so()(64bit) libcutlass_symm_sm80_c1688hemm.so()(64bit) libcutlass_symm_sm80_c1688symm.so()(64bit) libcutlass_symm_sm80_c1688tf32hemm.so()(64bit) libcutlass_symm_sm80_c1688tf32symm.so()(64bit) libcutlass_symm_sm80_d884symm.so()(64bit) libcutlass_symm_sm80_gz884hemm.so()(64bit) libcutlass_symm_sm80_gz884symm.so()(64bit) libcutlass_symm_sm80_s1688symm.so()(64bit) libcutlass_symm_sm80_s1688tf32symm.so()(64bit) libcutlass_symm_sm80_z884hemm.so()(64bit) libcutlass_symm_sm80_z884symm.so()(64bit) libcutlass_symm_sm90_d1684symm.so()(64bit) libcutlass_symm_sm90_gz1684hemm.so()(64bit) libcutlass_symm_sm90_gz1684symm.so()(64bit) libcutlass_symm_sm90_z1684hemm.so()(64bit) libcutlass_symm_sm90_z1684symm.so()(64bit) libcutlass_trmm_sm80_c1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_c1688trmm.so()(64bit) libcutlass_trmm_sm80_d884trmm.so()(64bit) libcutlass_trmm_sm80_gz884trmm.so()(64bit) libcutlass_trmm_sm80_s1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_s1688trmm.so()(64bit) libcutlass_trmm_sm80_z884trmm.so()(64bit) libcutlass_trmm_sm90_d1684trmm.so()(64bit) libcutlass_trmm_sm90_gz1684trmm.so()(64bit) libcutlass_trmm_sm90_z1684trmm.so()(64bit) libgcc_s.so.1()(64bit) libgcc_s.so.1(GCC_3.0)(64bit) libm.so.6()(64bit) libm.so.6(GLIBC_2.2.5)(64bit) libstdc++.so.6()(64bit) libstdc++.so.6(CXXABI_1.3)(64bit) libstdc++.so.6(CXXABI_1.3.5)(64bit) libstdc++.so.6(CXXABI_1.3.9)(64bit) libstdc++.so.6(GLIBCXX_3.4)(64bit) libstdc++.so.6(GLIBCXX_3.4.11)(64bit) libstdc++.so.6(GLIBCXX_3.4.15)(64bit) libstdc++.so.6(GLIBCXX_3.4.18)(64bit) libstdc++.so.6(GLIBCXX_3.4.20)(64bit) libstdc++.so.6(GLIBCXX_3.4.21)(64bit) libstdc++.so.6(GLIBCXX_3.4.5)(64bit) libstdc++.so.6(GLIBCXX_3.4.9)(64bit) rtld(GNU_HASH) Processing files: cutlass-devel-3.5.0-20240411.1.cu12_4.el8.x86_64 Provides: cmake(NvidiaCutlass) = 3.5.0 cmake(nvidiacutlass) = 3.5.0 cutlass-devel = 3.5.0-20240411.1.cu12_4.el8 cutlass-devel(x86-64) = 3.5.0-20240411.1.cu12_4.el8 Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Requires: cmake-filesystem(x86-64) Processing files: cutlass-static-3.5.0-20240411.1.cu12_4.el8.x86_64 Provides: cutlass-static = 3.5.0-20240411.1.cu12_4.el8 cutlass-static(x86-64) = 3.5.0-20240411.1.cu12_4.el8 Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Checking for unpackaged file(s): /usr/lib/rpm/check-files /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 Wrote: /builddir/build/RPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64.rpm Wrote: /builddir/build/RPMS/cutlass-devel-3.5.0-20240411.1.cu12_4.el8.x86_64.rpm Wrote: /builddir/build/RPMS/cutlass-static-3.5.0-20240411.1.cu12_4.el8.x86_64.rpm Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.07yPHP + umask 022 + cd /builddir/build/BUILD + cd cutlass + /usr/bin/rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.x86_64 + exit 0 Finish: rpmbuild cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Finish: build phase for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm INFO: chroot_scan: 3 files copied to /var/lib/copr-rpmbuild/results/chroot_scan INFO: /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log/dnf.rpm.log /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log/dnf.librepo.log /var/lib/mock/rhel+epel-8-x86_64-1713469181.334935/root/var/log/dnf.log INFO: Done(/var/lib/copr-rpmbuild/results/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm) Config(child) 444 minutes 10 seconds INFO: Results and/or logs in: /var/lib/copr-rpmbuild/results INFO: Cleaning up build root ('cleanup_on_success=True') Start: clean chroot INFO: unmounting tmpfs. Finish: clean chroot Finish: run Running RPMResults tool Package info: { "packages": [ { "name": "cutlass", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "src" }, { "name": "cutlass-static", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "x86_64" }, { "name": "cutlass", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "x86_64" }, { "name": "cutlass-devel", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "x86_64" } ] } RPMResults finished