Comment by delusional

`perf` to get go from the "it's stuttering" to "it's spending a very long time in the gpu driver". GDB and printf debugging to get to "the sort in the driver is taking a long time because there are an excessively large amount of TTM buffer objects, not because we are calling it too much". I could have made that leap faster, and I will the next time, but this time that step took me a couple of hours. From there it was a question of who is making those buffer objects, and so it was back to GDB to find nothing in sway/wlroots.

That was where I sort of ran out of good ideas. I have never worked with Wayland before. I figured it's a "protocol" so it must have a way to inspect it, and it does. `WAYLAND_DEBUG=1` allows you to dump the wayland messages, which I then manually inspected to find a discrepancy between allocations and dealloctions. That's a client (aka firefox) bug, so I looked through their issue tracker where I found a somewhat similar bug[1]. I reported my findings there.

Since then I've checked out the firefox code (which I've also never worked with before). Back in GDB and the logs, and I think I know what's going wrong. You can read the bugzilla for that though.

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1999636

delusional a day ago

I have looked into it. This appears to be a Firefox bug when HDR is enabled on wayland and the website is using webgl. Firefox looks to be leaking wl_buffer objects which are causing a VRAM leak in the wayland compositor which then causes performance issues in the AMDGPU TTM buffer object management.

Reply View 2 replies

perching_aix 12 hours ago

Nice dig. Could you share more about how you narrowed it down in the end? Is it a known issue and you just had to confirm it applies, or did you identify all of this yourself?

Reply View | 1 reply
- delusional 3 hours ago
  
  `perf` to get go from the "it's stuttering" to "it's spending a very long time in the gpu driver". GDB and printf debugging to get to "the sort in the driver is taking a long time because there are an excessively large amount of TTM buffer objects, not because we are calling it too much". I could have made that leap faster, and I will the next time, but this time that step took me a couple of hours. From there it was a question of who is making those buffer objects, and so it was back to GDB to find nothing in sway/wlroots.
  That was where I sort of ran out of good ideas. I have never worked with Wayland before. I figured it's a "protocol" so it must have a way to inspect it, and it does. `WAYLAND_DEBUG=1` allows you to dump the wayland messages, which I then manually inspected to find a discrepancy between allocations and dealloctions. That's a client (aka firefox) bug, so I looked through their issue tracker where I found a somewhat similar bug[1]. I reported my findings there.
  Since then I've checked out the firefox code (which I've also never worked with before). Back in GDB and the logs, and I think I know what's going wrong. You can read the bugzilla for that though.
  [1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1999636
  
  Reply View | 0 replies