Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures #53434

Open
TeusLollo opened this issue Dec 9, 2024 · 42 comments
Labels
bug Something isn't working

Comments

@TeusLollo
Copy link

TeusLollo commented Dec 9, 2024

Is this a new report?

Yes

System Info

Void 6.6.63_1 x86_64 GenuineIntel uptodate rFFFF

Package(s) Affected

corectrl-1.4.1_1

Does a report exist for this bug with the project's home (upstream) and/or another distro?

None Found.

Expected behaviour

No segmentation faults given a compatible ABI interface.

Actual behaviour

Since Mesa update 027f896 I'm getting segmentation faults with Corectrl, and noticed that configure_args = "-Ddri3=enabled" was removed with no apparent reason that I could find (At least among discussions by Void Devs on Github. What Mesa devs may or may not have done I couldn't know). This could impact further applications if resulting in changes to the ABI interface, and that missing argument may itself be unintentional.

EDIT: See comments and links to upstream, but, basically, unintentional ABI changes procuring segfaults on multiple architectures

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253 (Segfault on AMD Polaris)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)

Steps to reproduce

Update to Mesa 24.3.1 027f896

Run Corectrl in a terminal

Amid other generic Qt5 errors, notice the segmentation fault at the end:

[09-12-24 19:24:34.442][I] No translation found for locale en_US [09-12-24 19:24:34.442][I] Using en_EN translation. QSystemTrayIcon::setVisible: No Icon set qt.qpa.wayland: Wayland does not support QWindow::requestActivate() zsh: segmentation fault corectrl

@TeusLollo TeusLollo added bug Something isn't working needs-testing Testing a PR or reproducing an issue needed labels Dec 9, 2024
@classabbyamp
Copy link
Member

@SpidFightFR

@classabbyamp
Copy link
Member

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

  • meson: delete dri3 build option

@SpidFightFR
Copy link
Contributor

Is this a new report?

Yes

System Info

Void 6.6.63_1 x86_64 GenuineIntel uptodate rFFFF

Package(s) Affected

corectrl-1.4.1_1

Does a report exist for this bug with the project's home (upstream) and/or another distro?

None Found.

Expected behaviour

No segmentation faults given a compatible ABI interface.

Actual behaviour

Since Mesa update 027f896 I'm getting segmentation faults with Corectrl, and noticed that configure_args = "-Ddri3=enabled" was removed with no apparent reason that I could find. This could impact further applications if resulted in changes to the ABI interface, and that missing argument may be unintentional.

Steps to reproduce

Update to Mesa 24.3.1 027f896

Run Corectrl in a terminal

Amid other generic Qt5 error, notice the segmentation faul at the end:

[09-12-24 19:24:34.442][I] No translation found for locale en_US [09-12-24 19:24:34.442][I] Using en_EN translation. QSystemTrayIcon::setVisible: No Icon set qt.qpa.wayland: Wayland does not support QWindow::requestActivate() zsh: segmentation fault corectrl

On 24.3.0, this argument was removed from the build options.

Either they re-added it in 24.3.1 (which i'll recheck) or it got replaced by another opt, or it is a bug within mesa itself.

@TeusLollo
Copy link
Author

TeusLollo commented Dec 9, 2024

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

  • meson: delete dri3 build option

I of course meant "No reasons that I could find among Void devs". Will be changing spelling soon.

@classabbyamp
Copy link
Member

in fact, dri3 is now always enabled: https://gitlab.freedesktop.org/mesa/mesa/-/commit/8f6fca89aa1812b03da6d9f7fac3966955abc41e

@TeusLollo
Copy link
Author

TeusLollo commented Dec 9, 2024

Could be a bug in Mesa then, I am absolutely sure it was only Mesa to update, and I started my Void Box only a few minutes ago after keeping it off for 20+ hours (No segmentation fauls before, and no relevant updates to Corectrl in weeks).
Other apps/binaries may be affected, in every case, if ABI changed, intentionally or not.

@SpidFightFR
Copy link
Contributor

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

  • meson: delete dri3 build option

I of course meant "No reasons that I could find amond Void devs". Will be changing spelling soon.

no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0.

@SpidFightFR
Copy link
Contributor

Could be a bug in Mesa then, I am absolutely sure it was only Mesa to update, and I started my Void Box only a few minutes ago after keeping it off for 20+ hours (No segmentation fauls before, and no relevant updates to Corectrl in weeks). Other apps/binaries may be affected, in every case, if ABI changed, intentionally or not.

I'll keep an eye out on the different issues tab and stuff. just in case.

@TeusLollo
Copy link
Author

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

  • meson: delete dri3 build option

I of course meant "No reasons that I could find amond Void devs". Will be changing spelling soon.

no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0.

I just typed the issue very fast cause I'm in a hurry, and It may have sounded wrong. Thanks for your understanding.

If we find other apps/binaries being segfaulted, it may be prudent to reverse Mesa though, knowing it can be a pain (Well, it's also a pain to downgrade so many packages, I'm getting a list right now)

@SpidFightFR
Copy link
Contributor

fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:

  • meson: delete dri3 build option

I of course meant "No reasons that I could find amond Void devs". Will be changing spelling soon.

no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0.

I just typed the issue very fast cause I'm in a hurry, and It may have sounded wrong. Thanks for your understanding.

If we find other apps/binaries being segfaulted, it may be prudent to reverse Mesa though, knowing it can be a pain (Well, it's also a pain to downgrade so many packages, I'm getting a list right now)

that is okay really, the faster we identify problems, the better.
I do prefer the way you opened the issue, even if it may be a false positive, rather than letting an important bug pass through in production.

@SpidFightFR
Copy link
Contributor

I'll update it on my production machine. and check what may happen.

@TeusLollo
Copy link
Author

TeusLollo commented Dec 9, 2024

STATUS UPDATE:

Just ran the following (Github is ignoring formatting)

sudo xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-dri-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-dri-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-libgallium-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-libgallium-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-overlay-layer-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-overlay-layer-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-radeon-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-vulkan-radeon-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/libgbm-24.2.7_1.x86_64.xbps /var/cache/xbps/libgbm-32bit-24.2.7_1.x86_64.xbps /var/cache/xbps/libOSMesa-24.2.7_1.x86_64.xbps /var/cache/xbps/libOSMesa-32bit-24.2.7_1.x86_64.xbps

And...Corectrl was correctly launched with no segmentation fault errors. Running in userspace tray right now with no apparent problems, GPU fans were also correctly manipulated by the application.

I'm guessing something happened with Mesa's ABI.

If we can't find other apps affected, I'll open an issue on Corectrl's repo.

@EnumuratedDev
Copy link

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

@TeusLollo
Copy link
Author

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case).

@TeusLollo TeusLollo changed the title Removal of -Ddri3=enabled from Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications Dec 9, 2024
@SpidFightFR
Copy link
Contributor

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case).

indeed it didn't noticed it in my testing, i guess RDNA3 GPUs aren't affected somehow... My apologies for that.

@TeusLollo
Copy link
Author

I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253

Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case).

indeed it didn't noticed it in my testing, i guess RDNA3 GPUs aren't affected somehow... My apologies for that.

It's fine. they're adding OpenGL 4.6 API, so they're bound to have lots of regressions before they stabilize, and we can't ask maintainers to do all the tests that mesa-dri devs should be doing in the first place.

@TeusLollo TeusLollo changed the title Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications On AMD Polaris-based GPUs Dec 9, 2024
@sofijacom
Copy link

sofijacom commented Dec 10, 2024

AMD Radeon

After updating Mesa 24.3.1 the system stopped loading, the screen is black, it no longer responds to anything.

@biopsin
Copy link
Contributor

biopsin commented Dec 10, 2024

I have one AMD pc with Polaris 20 RX570 which I will manage to test later today..

@risusinf
Copy link

AMD Radeon

After updating Mesa 24.3.1 the system stopped loading, the screen is black, it no longer responds to anything.

Same with A8-9600 integrated graphics. Fixed by downgrading.

sudo xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-dri-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps /var/cache/xbps/mesa-libgallium-24.2.7_1.x86_64.xbps /var/cache/xbps/libgbm-24.2.7_1.x86_64.xbps

@SpidFightFR
Copy link
Contributor

@classabbyamp maybe it's best to revert my changes, wait for things to calm down?

@classabbyamp
Copy link
Member

or maybe the patch from upstream could be tested

either way, make a pr please

@naneros
Copy link

naneros commented Dec 10, 2024

AMD Radeon RX 580
Firefox segmentation fault, supertuxkart black screen.
Downgrading to 24.2.7 fix problems.

@SpidFightFR
Copy link
Contributor

or maybe the patch from upstream could be tested

either way, make a pr please

I'm at work right now, could you please send the link to the patch so that i can check it as soon as possible to make the pr please?

@TeusLollo
Copy link
Author

TeusLollo commented Dec 10, 2024

or maybe the patch from upstream could be tested
either way, make a pr please

I'm at work right now, could you please send the link to the patch so that i can check it as soon as possible to make the pr please?

Further segfaults issues are flooding the issues tab, and don't seem related to this specific bug:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)

With this many segfaults on all available GPU vendors, I would begin to doubt this is a problem limited to Polaris architecture (Though it's probably a multi-bug release due to fundamental ABI changes).
I would say it may be safer to just revert for now, and wait for Mesa to cook-up a little their next update. As I said before, they're working on the OpenGL 4.6 API, and thus they're bound to have lots of regressions before stabilizing. Besides, I don't think anyone specifically requesting those OpenGL 4.6 API changes expects them to be finalized already.

@TeusLollo TeusLollo changed the title Mesa 24.3.1 Update May Be Causing Segmentation Faults In Some Applications On AMD Polaris-based GPUs Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures Dec 10, 2024
@TeusLollo
Copy link
Author

TeusLollo commented Dec 10, 2024

Further segfaults issues are flooding the issues tab, and don't seem related to this specific bug:

https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro) https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)

With this many segfaults on all available GPU vendors, I would begin to doubt this is a problem limited to Polaris architecture (Though it's probably a multi-bug release due to fundamental ABI changes). I would say it may be safer to just revert for now, and wait for Mesa to cook-up a little their next update. As I said before, they're working on the OpenGL 4.6 API, and thus they're bound to have lots of regressions before stabilizing. Besides, I don't think anyone specifically requesting those OpenGL 4.6 API changes expect them to be finalized already.

OK, those issues seem confirmed, and we're already seeing devolopers commenting, and horror screenshoots. Definitively, there's ABI changes that broke support on multiple architectures, and they don't seem intentional, nor Mesa developers mentioned dropping support for any architecture. I changed the title accordingly.

I will keep the issue open even after a revert-pull (Which is looking more likely now, unless maintainers want to pull from upstream at least 3 fixes), since we'll probably need some extensive testing on whatever Mesa 24.3.2+ package will need to be made after Mesa devs actually can cook-up some fixes, doubt they'll be staying long on Mesa 24.3.1 with all those bugs cropping up.

I ask mantainers to link this issue to pull request meant to revert/fix those bugs for easier tracking, if possible.

Anyone else is free to link this issue to Mesa gitlab devs in case they need more info/testing.

@biopsin
Copy link
Contributor

biopsin commented Dec 10, 2024

I have one AMD pc with Polaris 20 RX570 which I will manage to test later today..
Unfortunatly I cant tickle a segfault testing in general, firefox & webgl, a game ..
Im down with downgrade util this pans out, seems they have their hands full for the time being.

@zlice
Copy link
Contributor

zlice commented Dec 11, 2024

drifix.patch.txt

Does this fix anything for anyone? Sure mesa crew would like to know if so

Ser also posted this patch with a PR near the bottom of that first issue 12253

drifix-simon.patch.txt

@hvraven
Copy link

hvraven commented Dec 11, 2024

Hit the same bug with a RX480. Added the patch (downloaded directly from gitlab) to the mesa 24.3.1 pkg and can confirm it fixes the issue.

@TeusLollo
Copy link
Author

Hit the same bug with a RX480. Added the patch (downloaded directly from gitlab) to the mesa 24.3.1 pkg and can confirm it fixes the issue.

Very well, but it would be prudent, as mentioned here #53470 (comment), to just wait for a potential .2 future version, given that multiple architectures across multiple vendors are affected.

I won't be closing the issue for now (If that's okay with maintainers, of course), since we've hit a good amount of potential testers for when a .2 future version hits, whenever it happens, to which we'll have to eventually update anyway.

@SpidFightFR Thanks for the downgrade (Tested on 2 Void boxes, it just werks).
And I noticed you were apologizing at #53453 (comment). I just wanted to tell you that your work has been excellent so far, and you shouldn't feel the need to apologize in this particular occasion. You did nothing wrong, it's just that Mesa devs released an update that probably should have been baked-in a little more. You did well, regardless.
As always, I would like to thank you, the rest of the Void Linux team, and everyone else from the userbase for their efforts and continued support.

@SpidFightFR
Copy link
Contributor

@TeusLollo Hey, thanks for your message, it truly means a lot.
Although, I'm not officially a maintainer for mesa, i do try my best to keep it updated within void repos, as much as possible. Because it means a lot in my personal usage.

I'll keep an eye out for the next minor release of mesa's 24.3 branch.
In the meantime, there should be one or so new releases into 24.2 branch, so i'll make sure to take these into account.

@classabbyamp
Copy link
Member

users affected by this, please test this: #53601

@classabbyamp classabbyamp removed the needs-testing Testing a PR or reproducing an issue needed label Dec 23, 2024
@narodnik
Copy link

Hey I'm not sure if related or not, but there has been a serious bug with amdgpu and mesa. See the issues here: https://gitlab.freedesktop.org/drm/amd/-/issues

Multiple reports of people's machines just freezing. In my case, the wayland session is usable for ~20 mins then everything locks up. I just use the TTY now on that machine.

I tried using older kernel's 6.6, 6.10, 6.11. I tried downgrading amdgpu from 20240909_1 to 20241110_1 and 20240909_1. I tried switching to onboard integrated graphics (also AMD) and not using the graphics card.
Still the issue persists. The computer is near unusable. Not sure what else to try.
I'm using Wayland.

VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 32 [Radeon RX 7700 XT / 7800 XT] (rev c8)
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c1)

So I suspect the problem is with the recent mesa upgrade 24.2.8_2. I tried downgrading my mesa but there are so many confusing dependencies that it's very difficult.

@TeusLollo
Copy link
Author

TeusLollo commented Dec 27, 2024

So I suspect the problem is with the recent mesa upgrade 24.2.8_2. I tried downgrading my mesa but there are so many confusing dependencies that it's very difficult.

Do this:

sudo xbps-install -Syu xtools
ls /var/cache/xbps/ | grep "mesa"
sudo xdowngrade /var/cache/xbps/mesa-(whatever1) /var/cache/xbps/mesa-(whatever2) /var/cache/xbps/mesa-(whatever3)...

Basically, after you have ensured to have xtools on your machine (Which you need to use xdowngrade), you use ls to list all your mesa packages currently in /var/cache/xbps (Remember that xbps does not provide online downgrades unless it's done by upstream, you physically require obsolete packages in your cache for them to be downgraded), and then you list them after xdowngrade to downgrade each and everyone of them. There are safety measures in place, thus, if you happen to attempt to only downgrade some, xdowngrade will throw you an error and tell you what's missing, it's relatively safe.

Obviously, if you already cleared your cache, you won't have any obsolete packages in your cache, and thus this method will be unavailable.

The manual entry here: https://docs.voidlinux.org/xbps/advanced-usage.html

It may not be related, though, AMD has an history of hard freezes on multiple occasions, although, you never know if a segmentation fault in a wayland-based WM could actually trigger a display freeze.

As for me, I may be able to test this in a few days, because, you know, end of the year with family and stuff, and it's gonna take a while do compile, install, and perform broad tests, assuming it all goes well.

@narodnik
Copy link

I try to downgrade mesa, but it says I need to downgrade libglapi. I try to downgrade libglapi, but it says I need to downgrade mesa.

~#  xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps 
index: added `mesa-24.2.7_1' (x86_64).
index: 1 packages registered.
MISSING: libglapi-24.2.7_1
Transaction aborted due to unresolved dependencies.
~# xdowngrade /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps 
index: added `libglapi-24.2.7_1' (x86_64).
index: 1 packages registered.
libglapi-24.2.7_1 in transaction breaks installed pkg `libOSMesa-24.2.8_2'
libglapi-24.2.7_1 in transaction breaks installed pkg `mesa-24.2.8_2'
libglapi-24.2.7_1 in transaction breaks installed pkg `mesa-libgallium-24.2.8_2'
Transaction aborted due to unresolved dependencies.

TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot.

https://gitlab.freedesktop.org/drm/amd/-/issues

@classabbyamp
Copy link
Member

do all the downgrades in 1 command.

TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot.

we are not using the latest release. we already downgraded back to 24.2.x from that. which version are people reporting causes freezes?

@TeusLollo
Copy link
Author

TeusLollo commented Dec 28, 2024

I try to downgrade mesa, but it says I need to downgrade libglapi. I try to downgrade libglapi, but it says I need to downgrade mesa.

You can add multiple arguments in succession. Thus, xdowngrade /var/cache/xbps/mesa-24.2.7_1.x86_64.xbps /var/cache/xbps/libglapi-24.2.7_1.x86_64.xbps (All in the same line at the same time) and so on.
There is no hard limit to the number of arguments you can assign to a command.
That should fix it, although you will probably have to downgrade more packages than mesa & libglapi, but it should be easy enough (xbps will tell you).
This downgrade failsafe is in place because mesa packages of differing version may not play nicely together, although results may vary.

TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot.

https://gitlab.freedesktop.org/drm/amd/-/issues

We, in fact, have already downgraded. We were on mesa-24.3.* and that's when I and other users noticed the great segfaults cascade. If you can confirm, after downgrading, that the issue is gone, and can manage to get the output of whatever wayland-based window manager you are using (It depends on the window manager, each and everyone are a bit different, use man to access their built-in manual), you will can see your yourself if it's a segfault error, or something else entirely.

BTW, keep in mind that AMD-based GPUs have an history of hard-freezes that is not related to mesa, it goes back and forth with each release and major updates on the kernel-level AMD driver (Which is not handled by mesa developers), and it may be related to linux-firmware-amd (Which is closed-source) not playing very nicely with such driver updates. A good approach is to use a tool like corectrl to block GPU performance to a fixed level (Low, high performance modes, or other approaches were you select a given GPU frequency and block it from downscaling or upscaling), which on my systems has avoided those entirely.

https://gitlab.freedesktop.org/drm/amd/-/issues/960
https://bugs.archlinux.org/task/68396
https://bugs.archlinux.org/task/68424
https://bugs.archlinux.org/task/68402
https://gitlab.freedesktop.org/drm/amd/-/issues/716

Also, ensure that linux-firmware-amd is up-to-date

@narodnik
Copy link

narodnik commented Dec 28, 2024

Thank you so much, that's very helpful.

I had these lock up issues a year ago, but then they went away.

When they were happening, they would happen maybe every 8+ hours. This reoccurrence is now more quicker like 30 mins or so.

Unfortunately I tried downgrading but it didn't go away. The system still freezes. I tried these combos:

Linux kernel 6.10:

~# xdowngrade /var/cache/xbps/linux-firmware-amd-20241110_1.x86_64.xbps
~# xdowngrade mesa-24.2.7_1.x86_64.xbps libglapi-24.2.7_1.x86_64.xbps libOSMesa-24.2.7_1.x86_64.xbps mesa-libgallium-24.2.7_1.x86_64.xbps libgbm-24.2.7_1.x86_64.xbps libgbm-devel-24.2.7_1.x86_64.xbps

Kernel 6.6:

~# xdowngrade linux-firmware-amd-20240909_1.x86_64.xbps
~# xdowngrade mesa-24.2.6_1.x86_64.xbps libglapi-24.2.6_1.x86_64.xbps libOSMesa-24.2.6_1.x86_64.xbps mesa-libgallium-24.2.6_1.x86_64.xbps libgbm-24.2.6_1.x86_64.xbps libgbm-devel-24.2.6_1.x86_64.xbps MesaLib-devel-24.2.6_1.x86_64.xbps

Is there anything I'm missing? When I downgrade firmware, do I need to run anything else? Could there be another component causing the issue?

Based off your advice, I tried underclocking the GPU but it didn't work either.

~# cd /sys/class/drm/card0/device
/sys/class/drm/card0/device# echo low > power_dpm_force_performance_level 
/sys/class/drm/card0/device# echo balanced > power_dpm_state 
/sys/class/drm/card0/device# cd ../../card1/device
/sys/class/drm/card1/device# echo low > power_dpm_force_performance_level 
/sys/class/drm/card1/device# echo balanced > power_dpm_state 

https://wiki.gentoo.org/wiki/AMDGPU#Frequent_and_Sporadic_Crashes

Still got the crash though.

@TeusLollo
Copy link
Author

TeusLollo commented Dec 29, 2024

Thank you so much, that's very helpful.

I had these lock up issues a year ago, but then they went away.

When they were happening, they would happen maybe every 8+ hours. This reoccurrence is now more quicker like 30 mins or so.

Unfortunately I tried downgrading but it didn't go away. The system still freezes. I tried these combos:

Every time the AMDGPU kernel driver is worked upon (Usually to add support for newer GPUs), something like this happens. It comes and goes, and it's been like this for years. Remember that the AMDGPU driver is shared by many GPU adapters (Like, tens of those), thus one bug will affect multi-generation adapters.

~# cd /sys/class/drm/card0/device
/sys/class/drm/card0/device# echo low > power_dpm_force_performance_level
/sys/class/drm/card0/device# echo balanced > power_dpm_state
/sys/class/drm/card0/device# cd ../../card1/device
/sys/class/drm/card1/device# echo low > power_dpm_force_performance_level
/sys/class/drm/card1/device# echo balanced > power_dpm_state


https://wiki.gentoo.org/wiki/AMDGPU#Frequent_and_Sporadic_Crashes

Still got the crash though.

I did not to say to attempt any undervolting (Which should not be attempted unless you really know what you're doing). Also, the commands you're listing here are not about undervolting, but about indexing available GPU power states (Again, avoid attempting underclocking).
Remember also that kernel parameters are needed at boot for (Most of) these to work, otherwise the GPU will just ignore those, but won't output that they're being ignored, and you'll be thinking it's working while it's not.

No, I was writing about doing this:

https://linuxreviews.org/HOWTO_undervolt_the_AMD_RX_4XX_and_RX_5XX_GPUs

But not the undervolting part, the "The Quick And Easy Way To Manually "Undervolt" AMD GPUs" (Which, again, is not "undervolting", but fixing GPU clock states, that's why they put it in the "").
Basically, look at the section for "HOWTO Limit The GPU To A Certain Set Of GPU Clock States" (But remember that you'll need to inject kernel parameters for this to work).
Then, limit the GPU clock state to ONE specific clock state (It's all in the guide). It's critical for it to be ONE clock state, for whatever reason, that black screen is triggered by the GPU switching clock states on the fly (But, if we force the GPU to stay on ONE clock state, there won't be any switching available, thus hopefully no black screen triggering).

Again, much easier to just get corectrl and do it from there. It'll also let you see quite clearly if you've injected kernel parameters correctly (Performance Mode: Advanced, Only have ONE GPU State checked, others disabled).

Monitor temperatures, though, you don't want the thing to catch on fire.

@narodnik
Copy link

narodnik commented Dec 29, 2024

Thanks so much. This is hugely helpful.

I followed your advice, and rebooted kernel 6.6 with the param amdgpu.ppfeaturemask=0xffffffff.

Then I open corectrl. My external GPU is card 0, and internal is card 1. For card 0, I see the GPU frequency constantly changing, but it remains fixed at 600 Mhz for card 1.

card0
card1

I also tried setting it through the sysfs API:

/sys/class/drm/card0/device# cat pp_dpm_sclk 
0: 500Mhz
1: 17Mhz *
2: 2254Mhz
/sys/class/drm/card0/device# echo 0 > pp_dpm_sclk 
/sys/class/drm/card0/device# cat pp_dpm_sclk 
0: 500Mhz
1: 7Mhz *
2: 2254Mhz

/sys/class/drm/card1/device# cat pp_dpm_sclk 
0: 400Mhz 
1: 600Mhz *
2: 2200Mhz 
/sys/class/drm/card1/device# echo 0 > pp_dpm_sclk 
bash: echo: write error: Invalid argument

Even with this, my system crashed.

Interestingly it only seems to happen when switching windows. If I use a single terminal window, the crash doesn't happen.

@TeusLollo
Copy link
Author

Thanks so much. This is hugely helpful.

I followed your advice, and rebooted kernel 6.6 with the param amdgpu.ppfeaturemask=0xffffffff.

Then I open corectrl. My external GPU is card 0, and internal is card 1. For card 0, I see the GPU frequency constantly changing, but it remains fixed at 600 Mhz for card 1.
Even with this, my system crashed.

Interestingly it only seems to happen when switching windows. If I use a single terminal window, the crash doesn't happen.

A few more things:

  1. Remember that with corectrl you need to press the "Apply" button on top-right of the window for settings to be injected, and "Save" (Only appears after having pressed "Apply") for those to be remembered. Double check you did that just in case.

  2. With hardware this powerful, you probably don't need two GPUs working in tandem (Which they don't really work in tandem, most of the work is offloaded to the PCIe-located GPU regardless). You can disable your CPU-embedded GPU in BIOS/UEFI. It'll save you on CPU heat, power consumption, and lots of headaches in the future when configuration gets confused because there are two display adapters. Even on a dual or triple setup monitor, there really is not a need to keep the CPU-embedded GPU activated. You may want to retry and see if a 1-GPU setup does away with crashes.

  3. You shouldn't downgrade the *-firmware packages. They're just collecton of binaries provided by manufacturers. Normally, newer version just increase the amount of binaries container, but some ancient binaries may be dropped because of security concerns. Since you're using pretty much recent hardware, you're better staying on the latest firmware packages.

  4. You did not mention what window manager (WM) you are using. If you're on one of the typical desktop environments (DEs) (Like Gnome, KDE, XFCE and the like), each of those comes with its own WM. You should take a look at their documentation and see where the WM locates .log files. If you're using a custom WM, you may want to check the documentation of that WM to see where it stores .log files. Once you've found .log files, run a search into those, and see if you can find anything with "segfault" or "seg" or "segmentation". Remember for the search to be case-insensitive and NOT to look for whole words only.

If not, than you're experiencing a different problem than the one we've been identifying here. If so, you may want to open a separate issue here on void-packages (Here it can be requested for a package downgrade assuming many are experiencing the same problem, or a patch commit if developers appear to have produced an hotfix somewhere).
You may otherwise open an issue on the repository of the WM developers, or the repository of the mesa developers (On gitlab) if you're absolutely sure this happened when mesa was updated. They'll take it from there, since here it's mostly software packaging and distribution to be done, not software development of each of those software suits.

@narodnik
Copy link

narodnik commented Dec 29, 2024

Thanks so much. You've really been very generous with your time. I've followed all your advice above. Disabling the internal GPU is a good idea. My WM is wlroots based DWL. I checked the output, and nothing seems to appear in dmesg nor my WM's output. But just looking at the amdgpu issue tracker, there's a whole load of new reports about crashing cards so I think this is not just isolated to me. But you're right, and you've given me some powerful leads to chase up. Thanks again.

Indeed seems a big issue: https://gitlab.freedesktop.org/drm/amd/-/issues/3092

What's strange is everything was fine until just a week ago when I updated. Now even downgrading doesn't fix it.

@narodnik
Copy link

narodnik commented Jan 1, 2025

hey just commenting that I managed to get a stable configuration, and have opened a new issue here: #53787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.