Comment by thrtythreeforty

Comment by thrtythreeforty 3 days ago

19 replies

This ticket, finally closed after being open for 2 years, is a pretty good micocosm of this problem:

https://github.com/ROCm/ROCm/issues/1714

Users complaining that the docs don't even specify which cards work.

But it goes deeper - a valid complaint is that "this only supports one or two consumer cards!" A common rebuttal is that it works fine on lots of AMD cards if you set some environment flag to force the GPU architecture selection. The fact that this is so close to working on a wide variety of hardware, and yet doesn't, is exactly the vibe you get with the whole ecosystem.

iforgotpassword 3 days ago

What I don't get is why they don't at least assign a dev or two to make the poster child of this work: llama.cpp

It's the first thing anyone tries when trying to dabble in AI or compute on the gpu, yet it's a clusterfuck to get to work. A few blessed cards work, with proper drivers and kernel; others just crash, perform horribly slow, or output GGGGGGGGGGGGGG to every input (I'm not making this up!) Then you LOL, dump it and go buy nvidia et voila, stuff works first try.

  • wkat4242 3 days ago

    It does work, I have it running on my Radeon VII Pro

    • Filligree 2 days ago

      It sometimes works.

      • wkat4242 2 days ago

        How so? It's rock solid for me. I use ollama but it's based on llama.cpp

        It's quite fast also, probably because that card has fast HBM2 memory (it has the same memory bandwidth as a 4090). And it was really cheap as it was on deep sale as an outgoing model.

mook 3 days ago

I suspect part of it is also that Nvidia actually does a lot of things in firmware that can be upgraded. The new Nvidia Linux drivers (the "open" ones) support Turing cards from 2018. That means chips that old already do much of the processing in firmware.

AMD keeps having issues because their drivers talk to the hardware directly so their drivers are massive bloated messes, famous for pages of auto-generated register definitions. Likely it's much more difficult to fix anything.

  • Evil_Saint 3 days ago

    Having worked at both Nvidia and AMD I can assure you that they both feature lots of generated header files.

  • bgnn 3 days ago

    Hmm that is interesting. Can you elaborate what is exactly different between them?

    I'm asking because I think a firmware has to directly talk to hardware through lower HAL (hardware abstraction layer), while customer facing parts should be fairly isolated in the upper HAL. Some companies like to add direct HW acces to customer interface via more complex functions (often a recipe made out of lower HAL functions), which I always disliked. I prefer to isolate lower level functions and memory space from the user.

    In any case, both Nvidia and AMD should have very similar FW capabilities. I don't know what I'm missing here.

    • Evil_Saint 3 days ago

      I worked on at both companies on drivers. The programming models are quite different. Both make GPUs but they were designed by different groups of people who made different decisions. For example:

      Nvidia cards are much easier to program in the user mode driver. You cannot hang a Nvidia GPU with a bad memory access. You can hang the display engine with one though. At least when I was there.

      You can hang an AMD GPU with a bad memory access. At least up to the Navi 3x.

    • raxxorraxor 3 days ago

      Why isolate these functions? That will always cripple capabilities. With well designed interfaces, it doesn't lead to a mess and a more powerful device. Of course these lower level functions shouldn't be essential, but especially in these times you almost have to provide an interface here or be left behind by other environments.

citizenpaul 3 days ago

I've thought about this myself and come to a conclusion that your link reinforces. As I understand it most companies doing (EE)hardware design and production consider (CS) software to be a second class citizen at the the company. It looks like AMD after all this time competing with NVIDIA has not learned the lesson. That said I have never worked in hardware so I'm taking what I've heard from other people.

NVIDIA while far from perfect has always easily kept stride in software quality ahead of AMD for over 20 years. While AMD repeatedly keeps falling on their face and getting egg all over themselves again and again and again as far as software goes.

My guess is NVIDIA internally has found a way to keep the software people from feeling like they are "less than" the people designing the hardware.

Sounds easily but apparently not. AKA management problems.

  • bgnn 3 days ago

    This is correct but one of the reasons is the SWE at HW companies are living in their own bubble. They somehow don't follow the rest of the SW developments.

    I'm a chip design engineer and I get frustrated with the garbage SW/FW team come up with, to the extent that I write my own FW library for my blocks. While doing that I try to learn the best practices, do quite a bit of research.

    One other reason is, SW was only FW till not long ago, which was serving the HW. So there was almost no input from SW to HW development. This is clearly changing but some companies, like Nvidia, are ahead of the pack. Even Apple SoC team is quite HW centric compared to Nvidia.

Covzire 3 days ago

That reeks of gross incompetence somewhere in the organization. Like a hosting company that has a customer dealing with very poor performance, over pays greatly to avoid it while the whole time nobody even thinks to check what the linux swap file is doing.

CoastalCoder 3 days ago

I had a similar (I think) experience when building LLVM from source a few years ago.

I kept running into some problem with LLVM's support for HIP code, even though I had not interest in having that functionality.

I realize this isn't exactly an AMD problem, but IIRC it was they were who contributed the troublesome code to LLVM, and it remained unfixed.

Apologies if there's something unfair or uninformed in what I wrote, it's been a while.

tomrod 3 days ago

Geez. If I were Berkshire Hathaway looking to invest in the GPU market, this would be a major red flag in my fundamentals analysis.