Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it

nulld3v an hour ago

https://github.com/triton-lang/triton/pull/7298#discussion_r...

> By disassembly of ptxas, it is indeed hard-coded that they have logic like: strstr(kernel_name, "cutlass").

> it is likely that, this is an unstable, experimental, aggressive optimization by NVIDIA, and blindly always enabling it may produce some elusive bugs.

Reply View 1 reply

temp0826 an hour ago

Thanks for a little context, this is not my wheelhouse at all (never even heard of this project) and I could not make heads or tails of the title or the linked PR.

Reply View | 0 replies

kilpikaarna 35 minutes ago

Keeping it real with the commit msgs

Reply View 3 replies

RestartKernel 19 minutes ago

I much prefer this over those AI generated commit messages that just say "reactored X" every single commit.

Reply View | 1 reply
- rkomorn 15 minutes ago
  
  So AI really does learn from humans...
  
  Reply View | 0 replies
IshKebab a few seconds ago

I think it's fine if you squash it. I have no idea why they didn't squash it before pushing to GitHub though.

Reply View | 0 replies

vasco an hour ago

This was discussed before at the time the PR was created and there's nothing new that I can see.

https://news.ycombinator.com/item?id=44530581

Reply View 1 reply

rob_c 6 minutes ago

Thought this looked familiar...

Reply View | 0 replies

globular-toast 14 minutes ago

Someone really needs to learn to use `git commit --amend`. Almost 100 commits with pointless commit messages like "wip" or "x"? Be kinder to your reviewers...

Reply View 1 reply

Zardoz84 4 minutes ago

and git commit --fixup git rebase -i --autosquash

Reply View | 0 replies

fooker 43 minutes ago

When intel did it, the pitchforks came out.

Nvidia seems to get a pass. Whys that?

Reply View 3 replies

kimixa 19 minutes ago

It really depends on details.
If intentionally slowing non CUTLASS shaders, sure pitchfork time.
If it's an option that /technically/ breaks the CUDA shader compatibility contract, then enabling it in specific "known good" situations is just business as usual for GPU drivers.
That can be for all kinds of reasons - straightforward bugs or incomplete paths in the optimization implementation, the app not actually needing the stricter parts of the contract so can have a faster path, or even bugs in apps that need workarounds.
Though piggybacking into these without understanding can be extremely fragile - you don't know why they've limited it, and you run the risk of tripping over some situation that will simply fail, either with incorrect results or something like a crash. And possibly in rather unexpected, unpredictable situations.

Reply View | 0 replies
m00x 27 minutes ago

This isn't the same thing

Reply View | 0 replies
whatevaa 41 minutes ago

Intel did this to consumers, nvidia does this to enterprises.

Reply View | 0 replies