Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures #53434

TeusLollo · 2024-12-09T18:25:18Z

Is this a new report?

Yes

System Info

Void 6.6.63_1 x86_64 GenuineIntel uptodate rFFFF

Package(s) Affected

corectrl-1.4.1_1

Does a report exist for this bug with the project's home (upstream) and/or another distro?

None Found.

Expected behaviour

No segmentation faults given a compatible ABI interface.

Actual behaviour

Since Mesa update 027f896 I'm getting segmentation faults with Corectrl, and noticed that configure_args = "-Ddri3=enabled" was removed with no apparent reason that I could find (At least among discussions by Void Devs on Github. What Mesa devs may or may not have done I couldn't know). This could impact further applications if resulting in changes to the ABI interface, and that missing argument may itself be unintentional.

EDIT: See comments and links to upstream, but, basically, unintentional ABI changes procuring segfaults on multiple architectures

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253 (Segfault on AMD Polaris)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)

Steps to reproduce

Update to Mesa 24.3.1 027f896

Run Corectrl in a terminal

Amid other generic Qt5 errors, notice the segmentation fault at the end:

[09-12-24 19:24:34.442][I] No translation found for locale en_US [09-12-24 19:24:34.442][I] Using en_EN translation. QSystemTrayIcon::setVisible: No Icon set qt.qpa.wayland: Wayland does not support QWindow::requestActivate() zsh: segmentation fault corectrl

The text was updated successfully, but these errors were encountered:

classabbyamp · 2024-12-09T18:30:37Z

@SpidFightFR

classabbyamp · 2024-12-09T18:32:40Z

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

meson: delete dri3 build option

SpidFightFR · 2024-12-09T18:34:14Z

Is this a new report?

Yes

System Info

Void 6.6.63_1 x86_64 GenuineIntel uptodate rFFFF

Package(s) Affected

corectrl-1.4.1_1

Does a report exist for this bug with the project's home (upstream) and/or another distro?

None Found.

Expected behaviour

No segmentation faults given a compatible ABI interface.

Actual behaviour

Since Mesa update 027f896 I'm getting segmentation faults with Corectrl, and noticed that configure_args = "-Ddri3=enabled" was removed with no apparent reason that I could find. This could impact further applications if resulted in changes to the ABI interface, and that missing argument may be unintentional.

Steps to reproduce

Update to Mesa 24.3.1 027f896

Run Corectrl in a terminal

Amid other generic Qt5 error, notice the segmentation faul at the end:

[09-12-24 19:24:34.442][I] No translation found for locale en_US [09-12-24 19:24:34.442][I] Using en_EN translation. QSystemTrayIcon::setVisible: No Icon set qt.qpa.wayland: Wayland does not support QWindow::requestActivate() zsh: segmentation fault corectrl

On 24.3.0, this argument was removed from the build options.

Either they re-added it in 24.3.1 (which i'll recheck) or it got replaced by another opt, or it is a bug within mesa itself.

TeusLollo · 2024-12-09T18:34:33Z

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

meson: delete dri3 build option

I of course meant "No reasons that I could find among Void devs". Will be changing spelling soon.

classabbyamp · 2024-12-09T18:35:40Z

in fact, dri3 is now always enabled: https://gitlab.freedesktop.org/mesa/mesa/-/commit/8f6fca89aa1812b03da6d9f7fac3966955abc41e

TeusLollo · 2024-12-09T18:37:15Z

Could be a bug in Mesa then, I am absolutely sure it was only Mesa to update, and I started my Void Box only a few minutes ago after keeping it off for 20+ hours (No segmentation fauls before, and no relevant updates to Corectrl in weeks).
Other apps/binaries may be affected, in every case, if ABI changed, intentionally or not.

SpidFightFR · 2024-12-09T18:38:06Z

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

meson: delete dri3 build option

I of course meant "No reasons that I could find amond Void devs". Will be changing spelling soon.

no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0.

SpidFightFR · 2024-12-09T18:39:39Z

Could be a bug in Mesa then, I am absolutely sure it was only Mesa to update, and I started my Void Box only a few minutes ago after keeping it off for 20+ hours (No segmentation fauls before, and no relevant updates to Corectrl in weeks). Other apps/binaries may be affected, in every case, if ABI changed, intentionally or not.

I'll keep an eye out on the different issues tab and stuff. just in case.

TeusLollo · 2024-12-09T18:39:41Z

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

meson: delete dri3 build option

I of course meant "No reasons that I could find amond Void devs". Will be changing spelling soon.

no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0.

I just typed the issue very fast cause I'm in a hurry, and It may have sounded wrong. Thanks for your understanding.

If we find other apps/binaries being segfaulted, it may be prudent to reverse Mesa though, knowing it can be a pain (Well, it's also a pain to downgrade so many packages, I'm getting a list right now)

SpidFightFR · 2024-12-09T18:41:42Z

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

meson: delete dri3 build option

I of course meant "No reasons that I could find amond Void devs". Will be changing spelling soon.

no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0.

I just typed the issue very fast cause I'm in a hurry, and It may have sounded wrong. Thanks for your understanding.

If we find other apps/binaries being segfaulted, it may be prudent to reverse Mesa though, knowing it can be a pain (Well, it's also a pain to downgrade so many packages, I'm getting a list right now)

that is okay really, the faster we identify problems, the better.
I do prefer the way you opened the issue, even if it may be a false positive, rather than letting an important bug pass through in production.

SpidFightFR · 2024-12-09T18:42:28Z

I'll update it on my production machine. and check what may happen.

TeusLollo · 2024-12-09T18:57:42Z

STATUS UPDATE:

Just ran the following (Github is ignoring formatting)

sudo xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-dri-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-dri-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-libgallium-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-libgallium-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-overlay-layer-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-overlay-layer-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-radeon-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-radeon-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/libgbm-24.2.7_1.x86_64.xbps /var/cache/xbps/libgbm-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/libOSMesa-24.2.7_1.x86_64.xbps /var/cache/xbps/libOSMesa-32bit-24.2.7_1.x86_64.xbps

And...Corectrl was correctly launched with no segmentation fault errors. Running in userspace tray right now with no apparent problems, GPU fans were also correctly manipulated by the application.

I'm guessing something happened with Mesa's ABI.

If we can't find other apps affected, I'll open an issue on Corectrl's repo.

EnumuratedDev · 2024-12-09T19:09:51Z

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

TeusLollo · 2024-12-09T19:19:47Z

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case).

SpidFightFR · 2024-12-09T19:58:46Z

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case).

indeed it didn't noticed it in my testing, i guess RDNA3 GPUs aren't affected somehow... My apologies for that.

TeusLollo · 2024-12-09T21:00:58Z

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case).

indeed it didn't noticed it in my testing, i guess RDNA3 GPUs aren't affected somehow... My apologies for that.

It's fine. they're adding OpenGL 4.6 API, so they're bound to have lots of regressions before they stabilize, and we can't ask maintainers to do all the tests that mesa-dri devs should be doing in the first place.

sofijacom · 2024-12-10T05:56:48Z

AMD Radeon

After updating Mesa 24.3.1 the system stopped loading, the screen is black, it no longer responds to anything.

biopsin · 2024-12-10T08:29:11Z

I have one AMD pc with Polaris 20 RX570 which I will manage to test later today..

risusinf · 2024-12-10T10:47:58Z

AMD Radeon

After updating Mesa 24.3.1 the system stopped loading, the screen is black, it no longer responds to anything.

Same with A8-9600 integrated graphics. Fixed by downgrading.

sudo xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-dri-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-libgallium-24.2.7_1.x86_64.xbps /var/cache/xbps/libgbm-24.2.7_1.x86_64.xbps

SpidFightFR · 2024-12-10T10:54:10Z

@classabbyamp maybe it's best to revert my changes, wait for things to calm down?

classabbyamp · 2024-12-10T12:49:48Z

or maybe the patch from upstream could be tested

either way, make a pr please

naneros · 2024-12-10T12:57:11Z

AMD Radeon RX 580
Firefox segmentation fault, supertuxkart black screen.
Downgrading to 24.2.7 fix problems.

SpidFightFR · 2024-12-10T13:49:25Z

or maybe the patch from upstream could be tested

either way, make a pr please

I'm at work right now, could you please send the link to the patch so that i can check it as soon as possible to make the pr please?

TeusLollo · 2024-12-10T15:43:21Z

or maybe the patch from upstream could be tested
either way, make a pr please

I'm at work right now, could you please send the link to the patch so that i can check it as soon as possible to make the pr please?

Further segfaults issues are flooding the issues tab, and don't seem related to this specific bug:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)

With this many segfaults on all available GPU vendors, I would begin to doubt this is a problem limited to Polaris architecture (Though it's probably a multi-bug release due to fundamental ABI changes).
I would say it may be safer to just revert for now, and wait for Mesa to cook-up a little their next update. As I said before, they're working on the OpenGL 4.6 API, and thus they're bound to have lots of regressions before stabilizing. Besides, I don't think anyone specifically requesting those OpenGL 4.6 API changes expects them to be finalized already.

TeusLollo · 2024-12-10T17:20:44Z

Further segfaults issues are flooding the issues tab, and don't seem related to this specific bug:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro) https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)

With this many segfaults on all available GPU vendors, I would begin to doubt this is a problem limited to Polaris architecture (Though it's probably a multi-bug release due to fundamental ABI changes). I would say it may be safer to just revert for now, and wait for Mesa to cook-up a little their next update. As I said before, they're working on the OpenGL 4.6 API, and thus they're bound to have lots of regressions before stabilizing. Besides, I don't think anyone specifically requesting those OpenGL 4.6 API changes expect them to be finalized already.

OK, those issues seem confirmed, and we're already seeing devolopers commenting, and horror screenshoots. Definitively, there's ABI changes that broke support on multiple architectures, and they don't seem intentional, nor Mesa developers mentioned dropping support for any architecture. I changed the title accordingly.

I will keep the issue open even after a revert-pull (Which is looking more likely now, unless maintainers want to pull from upstream at least 3 fixes), since we'll probably need some extensive testing on whatever Mesa 24.3.2+ package will need to be made after Mesa devs actually can cook-up some fixes, doubt they'll be staying long on Mesa 24.3.1 with all those bugs cropping up.

I ask mantainers to link this issue to pull request meant to revert/fix those bugs for easier tracking, if possible.

Anyone else is free to link this issue to Mesa gitlab devs in case they need more info/testing.

biopsin · 2024-12-10T18:00:11Z

I have one AMD pc with Polaris 20 RX570 which I will manage to test later today..
Unfortunatly I cant tickle a segfault testing in general, firefox & webgl, a game ..
Im down with downgrade util this pans out, seems they have their hands full for the time being.

zlice · 2024-12-11T01:00:31Z

drifix.patch.txt

Does this fix anything for anyone? Sure mesa crew would like to know if so

Ser also posted this patch with a PR near the bottom of that first issue 12253

drifix-simon.patch.txt

hvraven · 2024-12-11T10:18:29Z

Hit the same bug with a RX480. Added the patch (downloaded directly from gitlab) to the mesa 24.3.1 pkg and can confirm it fixes the issue.

TeusLollo · 2024-12-11T20:18:27Z

Hit the same bug with a RX480. Added the patch (downloaded directly from gitlab) to the mesa 24.3.1 pkg and can confirm it fixes the issue.

Very well, but it would be prudent, as mentioned here #53470 (comment), to just wait for a potential .2 future version, given that multiple architectures across multiple vendors are affected.

I won't be closing the issue for now (If that's okay with maintainers, of course), since we've hit a good amount of potential testers for when a .2 future version hits, whenever it happens, to which we'll have to eventually update anyway.

@SpidFightFR Thanks for the downgrade (Tested on 2 Void boxes, it just werks).
And I noticed you were apologizing at #53453 (comment). I just wanted to tell you that your work has been excellent so far, and you shouldn't feel the need to apologize in this particular occasion. You did nothing wrong, it's just that Mesa devs released an update that probably should have been baked-in a little more. You did well, regardless.
As always, I would like to thank you, the rest of the Void Linux team, and everyone else from the userbase for their efforts and continued support.

SpidFightFR · 2024-12-11T20:39:31Z

@TeusLollo Hey, thanks for your message, it truly means a lot.
Although, I'm not officially a maintainer for mesa, i do try my best to keep it updated within void repos, as much as possible. Because it means a lot in my personal usage.

I'll keep an eye out for the next minor release of mesa's 24.3 branch.
In the meantime, there should be one or so new releases into 24.2 branch, so i'll make sure to take these into account.

classabbyamp · 2024-12-23T21:30:12Z

users affected by this, please test this: #53601

narodnik · 2024-12-27T13:33:39Z

Hey I'm not sure if related or not, but there has been a serious bug with amdgpu and mesa. See the issues here: https://gitlab.freedesktop.org/drm/amd/-/issues

Multiple reports of people's machines just freezing. In my case, the wayland session is usable for ~20 mins then everything locks up. I just use the TTY now on that machine.

I tried using older kernel's 6.6, 6.10, 6.11. I tried downgrading amdgpu from 20240909_1 to 20241110_1 and 20240909_1. I tried switching to onboard integrated graphics (also AMD) and not using the graphics card.
Still the issue persists. The computer is near unusable. Not sure what else to try.
I'm using Wayland.

VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 32 [Radeon RX 7700 XT / 7800 XT] (rev c8)
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c1)

So I suspect the problem is with the recent mesa upgrade 24.2.8_2. I tried downgrading my mesa but there are so many confusing dependencies that it's very difficult.

TeusLollo · 2024-12-27T20:04:24Z

So I suspect the problem is with the recent mesa upgrade 24.2.8_2. I tried downgrading my mesa but there are so many confusing dependencies that it's very difficult.

Do this:

sudo xbps-install -Syu xtools
ls /var/cache/xbps/ | grep "mesa"
sudo xdowngrade /var/cache/xbps/mesa-(whatever1) /var/cache/xbps/mesa-(whatever2) /var/cache/xbps/mesa-(whatever3)...

Basically, after you have ensured to have xtools on your machine (Which you need to use xdowngrade), you use ls to list all your mesa packages currently in /var/cache/xbps (Remember that xbps does not provide online downgrades unless it's done by upstream, you physically require obsolete packages in your cache for them to be downgraded), and then you list them after xdowngrade to downgrade each and everyone of them. There are safety measures in place, thus, if you happen to attempt to only downgrade some, xdowngrade will throw you an error and tell you what's missing, it's relatively safe.

Obviously, if you already cleared your cache, you won't have any obsolete packages in your cache, and thus this method will be unavailable.

The manual entry here: https://docs.voidlinux.org/xbps/advanced-usage.html

It may not be related, though, AMD has an history of hard freezes on multiple occasions, although, you never know if a segmentation fault in a wayland-based WM could actually trigger a display freeze.

As for me, I may be able to test this in a few days, because, you know, end of the year with family and stuff, and it's gonna take a while do compile, install, and perform broad tests, assuming it all goes well.

narodnik · 2024-12-28T13:49:56Z

I try to downgrade mesa, but it says I need to downgrade libglapi. I try to downgrade libglapi, but it says I need to downgrade mesa.

~#  xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps 
index: added `mesa-24.2.7_1' (x86_64).
index: 1 packages registered.
MISSING: libglapi-24.2.7_1
Transaction aborted due to unresolved dependencies.
~# xdowngrade /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps 
index: added `libglapi-24.2.7_1' (x86_64).
index: 1 packages registered.
libglapi-24.2.7_1 in transaction breaks installed pkg `libOSMesa-24.2.8_2'
libglapi-24.2.7_1 in transaction breaks installed pkg `mesa-24.2.8_2'
libglapi-24.2.7_1 in transaction breaks installed pkg `mesa-libgallium-24.2.8_2'
Transaction aborted due to unresolved dependencies.

TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot.

https://gitlab.freedesktop.org/drm/amd/-/issues

classabbyamp · 2024-12-28T13:55:37Z

do all the downgrades in 1 command.

TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot.

we are not using the latest release. we already downgraded back to 24.2.x from that. which version are people reporting causes freezes?

TeusLollo · 2024-12-28T14:27:56Z

I try to downgrade mesa, but it says I need to downgrade libglapi. I try to downgrade libglapi, but it says I need to downgrade mesa.

You can add multiple arguments in succession. Thus, xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps (All in the same line at the same time) and so on.
There is no hard limit to the number of arguments you can assign to a command.
That should fix it, although you will probably have to downgrade more packages than mesa & libglapi, but it should be easy enough (xbps will tell you).
This downgrade failsafe is in place because mesa packages of differing version may not play nicely together, although results may vary.

TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot.

https://gitlab.freedesktop.org/drm/amd/-/issues

We, in fact, have already downgraded. We were on mesa-24.3.* and that's when I and other users noticed the great segfaults cascade. If you can confirm, after downgrading, that the issue is gone, and can manage to get the output of whatever wayland-based window manager you are using (It depends on the window manager, each and everyone are a bit different, use man to access their built-in manual), you will can see your yourself if it's a segfault error, or something else entirely.

BTW, keep in mind that AMD-based GPUs have an history of hard-freezes that is not related to mesa, it goes back and forth with each release and major updates on the kernel-level AMD driver (Which is not handled by mesa developers), and it may be related to linux-firmware-amd (Which is closed-source) not playing very nicely with such driver updates. A good approach is to use a tool like corectrl to block GPU performance to a fixed level (Low, high performance modes, or other approaches were you select a given GPU frequency and block it from downscaling or upscaling), which on my systems has avoided those entirely.

https://gitlab.freedesktop.org/drm/amd/-/issues/960
https://bugs.archlinux.org/task/68396
https://bugs.archlinux.org/task/68424
https://bugs.archlinux.org/task/68402
https://gitlab.freedesktop.org/drm/amd/-/issues/716

Also, ensure that linux-firmware-amd is up-to-date

narodnik · 2024-12-28T19:50:23Z

Thank you so much, that's very helpful.

I had these lock up issues a year ago, but then they went away.

When they were happening, they would happen maybe every 8+ hours. This reoccurrence is now more quicker like 30 mins or so.

Unfortunately I tried downgrading but it didn't go away. The system still freezes. I tried these combos:

Linux kernel 6.10:

~# xdowngrade /var/cache/xbps/linux-firmware-amd-20241110_1.x86_64.xbps
~# xdowngrade mesa-24.2.7_1.x86_64.xbps libglapi-24.2.7_1.x86_64.xbps libOSMesa-24.2.7_1.x86_64.xbps mesa-libgallium-24.2.7_1.x86_64.xbps libgbm-24.2.7_1.x86_64.xbps libgbm-devel-24.2.7_1.x86_64.xbps

Kernel 6.6:

~# xdowngrade linux-firmware-amd-20240909_1.x86_64.xbps
~# xdowngrade mesa-24.2.6_1.x86_64.xbps libglapi-24.2.6_1.x86_64.xbps libOSMesa-24.2.6_1.x86_64.xbps mesa-libgallium-24.2.6_1.x86_64.xbps libgbm-24.2.6_1.x86_64.xbps libgbm-devel-24.2.6_1.x86_64.xbps MesaLib-devel-24.2.6_1.x86_64.xbps

Is there anything I'm missing? When I downgrade firmware, do I need to run anything else? Could there be another component causing the issue?

Based off your advice, I tried underclocking the GPU but it didn't work either.

~# cd /sys/class/drm/card0/device
/sys/class/drm/card0/device# echo low > power_dpm_force_performance_level 
/sys/class/drm/card0/device# echo balanced > power_dpm_state 
/sys/class/drm/card0/device# cd ../../card1/device
/sys/class/drm/card1/device# echo low > power_dpm_force_performance_level 
/sys/class/drm/card1/device# echo balanced > power_dpm_state

https://wiki.gentoo.org/wiki/AMDGPU#Frequent_and_Sporadic_Crashes

Still got the crash though.

TeusLollo · 2024-12-29T01:24:06Z

Thank you so much, that's very helpful.

I had these lock up issues a year ago, but then they went away.

When they were happening, they would happen maybe every 8+ hours. This reoccurrence is now more quicker like 30 mins or so.

Unfortunately I tried downgrading but it didn't go away. The system still freezes. I tried these combos:

Every time the AMDGPU kernel driver is worked upon (Usually to add support for newer GPUs), something like this happens. It comes and goes, and it's been like this for years. Remember that the AMDGPU driver is shared by many GPU adapters (Like, tens of those), thus one bug will affect multi-generation adapters.

~# cd /sys/class/drm/card0/device
/sys/class/drm/card0/device# echo low > power_dpm_force_performance_level
/sys/class/drm/card0/device# echo balanced > power_dpm_state
/sys/class/drm/card0/device# cd ../../card1/device
/sys/class/drm/card1/device# echo low > power_dpm_force_performance_level
/sys/class/drm/card1/device# echo balanced > power_dpm_state
https://wiki.gentoo.org/wiki/AMDGPU#Frequent_and_Sporadic_Crashes

Still got the crash though.

I did not to say to attempt any undervolting (Which should not be attempted unless you really know what you're doing). Also, the commands you're listing here are not about undervolting, but about indexing available GPU power states (Again, avoid attempting underclocking).
Remember also that kernel parameters are needed at boot for (Most of) these to work, otherwise the GPU will just ignore those, but won't output that they're being ignored, and you'll be thinking it's working while it's not.

No, I was writing about doing this:

https://linuxreviews.org/HOWTO_undervolt_the_AMD_RX_4XX_and_RX_5XX_GPUs

But not the undervolting part, the "The Quick And Easy Way To Manually "Undervolt" AMD GPUs" (Which, again, is not "undervolting", but fixing GPU clock states, that's why they put it in the "").
Basically, look at the section for "HOWTO Limit The GPU To A Certain Set Of GPU Clock States" (But remember that you'll need to inject kernel parameters for this to work).
Then, limit the GPU clock state to ONE specific clock state (It's all in the guide). It's critical for it to be ONE clock state, for whatever reason, that black screen is triggered by the GPU switching clock states on the fly (But, if we force the GPU to stay on ONE clock state, there won't be any switching available, thus hopefully no black screen triggering).

Again, much easier to just get corectrl and do it from there. It'll also let you see quite clearly if you've injected kernel parameters correctly (Performance Mode: Advanced, Only have ONE GPU State checked, others disabled).

Monitor temperatures, though, you don't want the thing to catch on fire.

narodnik · 2024-12-29T09:21:01Z

Thanks so much. This is hugely helpful.

I followed your advice, and rebooted kernel 6.6 with the param amdgpu.ppfeaturemask=0xffffffff.

Then I open corectrl. My external GPU is card 0, and internal is card 1. For card 0, I see the GPU frequency constantly changing, but it remains fixed at 600 Mhz for card 1.

I also tried setting it through the sysfs API:

/sys/class/drm/card0/device# cat pp_dpm_sclk 
0: 500Mhz
1: 17Mhz *
2: 2254Mhz
/sys/class/drm/card0/device# echo 0 > pp_dpm_sclk 
/sys/class/drm/card0/device# cat pp_dpm_sclk 
0: 500Mhz
1: 7Mhz *
2: 2254Mhz

/sys/class/drm/card1/device# cat pp_dpm_sclk 
0: 400Mhz 
1: 600Mhz *
2: 2200Mhz 
/sys/class/drm/card1/device# echo 0 > pp_dpm_sclk 
bash: echo: write error: Invalid argument

Even with this, my system crashed.

Interestingly it only seems to happen when switching windows. If I use a single terminal window, the crash doesn't happen.

TeusLollo · 2024-12-29T14:14:04Z

Thanks so much. This is hugely helpful.

I followed your advice, and rebooted kernel 6.6 with the param amdgpu.ppfeaturemask=0xffffffff.

Then I open corectrl. My external GPU is card 0, and internal is card 1. For card 0, I see the GPU frequency constantly changing, but it remains fixed at 600 Mhz for card 1.
Even with this, my system crashed.

Interestingly it only seems to happen when switching windows. If I use a single terminal window, the crash doesn't happen.

A few more things:

Remember that with corectrl you need to press the "Apply" button on top-right of the window for settings to be injected, and "Save" (Only appears after having pressed "Apply") for those to be remembered. Double check you did that just in case.
With hardware this powerful, you probably don't need two GPUs working in tandem (Which they don't really work in tandem, most of the work is offloaded to the PCIe-located GPU regardless). You can disable your CPU-embedded GPU in BIOS/UEFI. It'll save you on CPU heat, power consumption, and lots of headaches in the future when configuration gets confused because there are two display adapters. Even on a dual or triple setup monitor, there really is not a need to keep the CPU-embedded GPU activated. You may want to retry and see if a 1-GPU setup does away with crashes.
You shouldn't downgrade the *-firmware packages. They're just collecton of binaries provided by manufacturers. Normally, newer version just increase the amount of binaries container, but some ancient binaries may be dropped because of security concerns. Since you're using pretty much recent hardware, you're better staying on the latest firmware packages.
You did not mention what window manager (WM) you are using. If you're on one of the typical desktop environments (DEs) (Like Gnome, KDE, XFCE and the like), each of those comes with its own WM. You should take a look at their documentation and see where the WM locates .log files. If you're using a custom WM, you may want to check the documentation of that WM to see where it stores .log files. Once you've found .log files, run a search into those, and see if you can find anything with "segfault" or "seg" or "segmentation". Remember for the search to be case-insensitive and NOT to look for whole words only.

If not, than you're experiencing a different problem than the one we've been identifying here. If so, you may want to open a separate issue here on void-packages (Here it can be requested for a package downgrade assuming many are experiencing the same problem, or a patch commit if developers appear to have produced an hotfix somewhere).
You may otherwise open an issue on the repository of the WM developers, or the repository of the mesa developers (On gitlab) if you're absolutely sure this happened when mesa was updated. They'll take it from there, since here it's mostly software packaging and distribution to be done, not software development of each of those software suits.

narodnik · 2024-12-29T19:08:07Z

Thanks so much. You've really been very generous with your time. I've followed all your advice above. Disabling the internal GPU is a good idea. My WM is wlroots based DWL. I checked the output, and nothing seems to appear in dmesg nor my WM's output. But just looking at the amdgpu issue tracker, there's a whole load of new reports about crashing cards so I think this is not just isolated to me. But you're right, and you've given me some powerful leads to chase up. Thanks again.

Indeed seems a big issue: https://gitlab.freedesktop.org/drm/amd/-/issues/3092

What's strange is everything was fine until just a week ago when I updated. Now even downgrading doesn't fix it.

narodnik · 2025-01-01T14:42:32Z

hey just commenting that I managed to get a stable configuration, and have opened a new issue here: #53787

TeusLollo added bug Something isn't working needs-testing Testing a PR or reproducing an issue needed labels Dec 9, 2024

TeusLollo changed the title ~~Removal of -Ddri3=enabled from Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications~~ Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications Dec 9, 2024

TeusLollo changed the title ~~Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications~~ Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications On AMD Polaris-based GPUs Dec 9, 2024

TeusLollo changed the title ~~Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications On AMD Polaris-based GPUs~~ Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures Dec 10, 2024

SpidFightFR mentioned this issue Dec 10, 2024

Revert: "mesa: update to 24.3.1." + Update to 24.2.8 #53453

Merged

SpidFightFR mentioned this issue Dec 11, 2024

mesa: update to 24.3.1. #53470

Closed

SpidFightFR mentioned this issue Dec 20, 2024

mesa: update to 24.3.2. #53601

Open

classabbyamp removed the needs-testing Testing a PR or reproducing an issue needed label Dec 23, 2024

narodnik mentioned this issue Jan 1, 2025

amdgpu system freeze #53787

Open

Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures #53434

Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures #53434

Comments

TeusLollo commented Dec 9, 2024 • edited Loading

Is this a new report?

System Info

Package(s) Affected

Does a report exist for this bug with the project's home (upstream) and/or another distro?

Expected behaviour

Actual behaviour

Steps to reproduce

classabbyamp commented Dec 9, 2024

classabbyamp commented Dec 9, 2024

SpidFightFR commented Dec 9, 2024

Is this a new report?

System Info

Package(s) Affected

Does a report exist for this bug with the project's home (upstream) and/or another distro?

Expected behaviour

Actual behaviour

Steps to reproduce

TeusLollo commented Dec 9, 2024 • edited Loading

classabbyamp commented Dec 9, 2024

TeusLollo commented Dec 9, 2024 • edited Loading

SpidFightFR commented Dec 9, 2024

SpidFightFR commented Dec 9, 2024

TeusLollo commented Dec 9, 2024

SpidFightFR commented Dec 9, 2024

SpidFightFR commented Dec 9, 2024

TeusLollo commented Dec 9, 2024 • edited Loading

EnumuratedDev commented Dec 9, 2024

TeusLollo commented Dec 9, 2024

SpidFightFR commented Dec 9, 2024

TeusLollo commented Dec 9, 2024

sofijacom commented Dec 10, 2024 • edited Loading

biopsin commented Dec 10, 2024

risusinf commented Dec 10, 2024

SpidFightFR commented Dec 10, 2024

classabbyamp commented Dec 10, 2024

naneros commented Dec 10, 2024

SpidFightFR commented Dec 10, 2024

TeusLollo commented Dec 10, 2024 • edited Loading

TeusLollo commented Dec 10, 2024 • edited Loading

biopsin commented Dec 10, 2024

zlice commented Dec 11, 2024

hvraven commented Dec 11, 2024

TeusLollo commented Dec 11, 2024

SpidFightFR commented Dec 11, 2024

classabbyamp commented Dec 23, 2024

narodnik commented Dec 27, 2024

TeusLollo commented Dec 27, 2024 • edited Loading

narodnik commented Dec 28, 2024

classabbyamp commented Dec 28, 2024

TeusLollo commented Dec 28, 2024 • edited Loading

narodnik commented Dec 28, 2024 • edited Loading

TeusLollo commented Dec 29, 2024 • edited Loading

narodnik commented Dec 29, 2024 • edited Loading

TeusLollo commented Dec 29, 2024

narodnik commented Dec 29, 2024 • edited Loading

narodnik commented Jan 1, 2025

TeusLollo commented Dec 9, 2024 •

edited

Loading

TeusLollo commented Dec 9, 2024 •

edited

Loading

TeusLollo commented Dec 9, 2024 •

edited

Loading

TeusLollo commented Dec 9, 2024 •

edited

Loading

sofijacom commented Dec 10, 2024 •

edited

Loading

TeusLollo commented Dec 10, 2024 •

edited

Loading

TeusLollo commented Dec 10, 2024 •

edited

Loading

TeusLollo commented Dec 27, 2024 •

edited

Loading

TeusLollo commented Dec 28, 2024 •

edited

Loading

narodnik commented Dec 28, 2024 •

edited

Loading

TeusLollo commented Dec 29, 2024 •

edited

Loading

narodnik commented Dec 29, 2024 •

edited

Loading

narodnik commented Dec 29, 2024 •

edited

Loading