-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mesa 24.3.1 Update Causing Segmentation Faults On Several-Vendors GPU Architectures #53434
Comments
fwiw, that removal is not the reason things are segfaulting, as the changelog indicates that flag was removed:
|
On 24.3.0, this argument was removed from the build options. Either they re-added it in 24.3.1 (which i'll recheck) or it got replaced by another opt, or it is a bug within mesa itself. |
I of course meant "No reasons that I could find among Void devs". Will be changing spelling soon. |
in fact, dri3 is now always enabled: https://gitlab.freedesktop.org/mesa/mesa/-/commit/8f6fca89aa1812b03da6d9f7fac3966955abc41e |
Could be a bug in Mesa then, I am absolutely sure it was only Mesa to update, and I started my Void Box only a few minutes ago after keeping it off for 20+ hours (No segmentation fauls before, and no relevant updates to Corectrl in weeks). |
no worries, though i shared the same thoughts as you when i made the original PR for 24.3.0. |
I'll keep an eye out on the different issues tab and stuff. just in case. |
I just typed the issue very fast cause I'm in a hurry, and It may have sounded wrong. Thanks for your understanding. If we find other apps/binaries being segfaulted, it may be prudent to reverse Mesa though, knowing it can be a pain (Well, it's also a pain to downgrade so many packages, I'm getting a list right now) |
that is okay really, the faster we identify problems, the better. |
I'll update it on my production machine. and check what may happen. |
STATUS UPDATE: Just ran the following (Github is ignoring formatting)
And... I'm guessing something happened with Mesa's ABI. If we can't find other apps affected, I'll open an issue on Corectrl's repo. |
I've had multiple programs affected by this issue. FIrefox, alacritty, discord (through flatpak) and more. Downgrading fixed all issues. I believe this is related to this mesa issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253 |
Seems only to affect AMD Polaris-based GPUs, which is indeed what I am using (Hey, don't judge me, those monster GPUs they make today don't fit in my pc case). |
indeed it didn't noticed it in my testing, i guess RDNA3 GPUs aren't affected somehow... My apologies for that. |
It's fine. they're adding OpenGL 4.6 API, so they're bound to have lots of regressions before they stabilize, and we can't ask maintainers to do all the tests that mesa-dri devs should be doing in the first place. |
AMD Radeon After updating Mesa 24.3.1 the system stopped loading, the screen is black, it no longer responds to anything. |
I have one AMD pc with Polaris 20 RX570 which I will manage to test later today.. |
Same with A8-9600 integrated graphics. Fixed by downgrading.
|
@classabbyamp maybe it's best to revert my changes, wait for things to calm down? |
or maybe the patch from upstream could be tested either way, make a pr please |
AMD Radeon RX 580 |
I'm at work right now, could you please send the link to the patch so that i can check it as soon as possible to make the pr please? |
Further segfaults issues are flooding the issues tab, and don't seem related to this specific bug: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro) With this many segfaults on all available GPU vendors, I would begin to doubt this is a problem limited to Polaris architecture (Though it's probably a multi-bug release due to fundamental ABI changes). |
OK, those issues seem confirmed, and we're already seeing devolopers commenting, and horror screenshoots. Definitively, there's ABI changes that broke support on multiple architectures, and they don't seem intentional, nor Mesa developers mentioned dropping support for any architecture. I changed the title accordingly. I will keep the issue open even after a revert-pull (Which is looking more likely now, unless maintainers want to pull from upstream at least 3 fixes), since we'll probably need some extensive testing on whatever I ask mantainers to link this issue to pull request meant to revert/fix those bugs for easier tracking, if possible. Anyone else is free to link this issue to Mesa gitlab devs in case they need more info/testing. |
|
Does this fix anything for anyone? Sure mesa crew would like to know if so Ser also posted this patch with a PR near the bottom of that first issue 12253 |
Hit the same bug with a RX480. Added the patch (downloaded directly from gitlab) to the mesa 24.3.1 pkg and can confirm it fixes the issue. |
Very well, but it would be prudent, as mentioned here #53470 (comment), to just wait for a potential .2 future version, given that multiple architectures across multiple vendors are affected. I won't be closing the issue for now (If that's okay with maintainers, of course), since we've hit a good amount of potential testers for when a .2 future version hits, whenever it happens, to which we'll have to eventually update anyway. @SpidFightFR Thanks for the downgrade (Tested on 2 Void boxes, it just werks). |
@TeusLollo Hey, thanks for your message, it truly means a lot. I'll keep an eye out for the next minor release of mesa's 24.3 branch. |
users affected by this, please test this: #53601 |
Hey I'm not sure if related or not, but there has been a serious bug with amdgpu and mesa. See the issues here: https://gitlab.freedesktop.org/drm/amd/-/issues Multiple reports of people's machines just freezing. In my case, the wayland session is usable for ~20 mins then everything locks up. I just use the TTY now on that machine. I tried using older kernel's 6.6, 6.10, 6.11. I tried downgrading amdgpu from 20240909_1 to 20241110_1 and 20240909_1. I tried switching to onboard integrated graphics (also AMD) and not using the graphics card.
So I suspect the problem is with the recent mesa upgrade 24.2.8_2. I tried downgrading my mesa but there are so many confusing dependencies that it's very difficult. |
Do this:
Basically, after you have ensured to have Obviously, if you already cleared your cache, you won't have any obsolete packages in your cache, and thus this method will be unavailable. The manual entry here: https://docs.voidlinux.org/xbps/advanced-usage.html It may not be related, though, AMD has an history of hard freezes on multiple occasions, although, you never know if a segmentation fault in a wayland-based WM could actually trigger a display freeze. As for me, I may be able to test this in a few days, because, you know, end of the year with family and stuff, and it's gonna take a while do compile, install, and perform broad tests, assuming it all goes well. |
I try to downgrade mesa, but it says I need to downgrade libglapi. I try to downgrade libglapi, but it says I need to downgrade mesa.
TBH I think void should just downgrade mesa/amdgpu. There's something very clearly massively broken in this latest release. People are reporting their systems just freezing and requiring a hard reboot. |
do all the downgrades in 1 command.
we are not using the latest release. we already downgraded back to 24.2.x from that. which version are people reporting causes freezes? |
You can add multiple arguments in succession. Thus,
We, in fact, have already downgraded. We were on BTW, keep in mind that AMD-based GPUs have an history of hard-freezes that is not related to https://gitlab.freedesktop.org/drm/amd/-/issues/960 Also, ensure that |
Thank you so much, that's very helpful. I had these lock up issues a year ago, but then they went away. When they were happening, they would happen maybe every 8+ hours. This reoccurrence is now more quicker like 30 mins or so. Unfortunately I tried downgrading but it didn't go away. The system still freezes. I tried these combos: Linux kernel 6.10:
Kernel 6.6:
Is there anything I'm missing? When I downgrade firmware, do I need to run anything else? Could there be another component causing the issue? Based off your advice, I tried underclocking the GPU but it didn't work either.
https://wiki.gentoo.org/wiki/AMDGPU#Frequent_and_Sporadic_Crashes Still got the crash though. |
Every time the AMDGPU kernel driver is worked upon (Usually to add support for newer GPUs), something like this happens. It comes and goes, and it's been like this for years. Remember that the AMDGPU driver is shared by many GPU adapters (Like, tens of those), thus one bug will affect multi-generation adapters.
I did not to say to attempt any undervolting (Which should not be attempted unless you really know what you're doing). Also, the commands you're listing here are not about undervolting, but about indexing available GPU power states (Again, avoid attempting underclocking). No, I was writing about doing this: https://linuxreviews.org/HOWTO_undervolt_the_AMD_RX_4XX_and_RX_5XX_GPUs But not the undervolting part, the "The Quick And Easy Way To Manually "Undervolt" AMD GPUs" (Which, again, is not "undervolting", but fixing GPU clock states, that's why they put it in the ""). Again, much easier to just get Monitor temperatures, though, you don't want the thing to catch on fire. |
Thanks so much. This is hugely helpful. I followed your advice, and rebooted kernel 6.6 with the param Then I open corectrl. My external GPU is card 0, and internal is card 1. For card 0, I see the GPU frequency constantly changing, but it remains fixed at 600 Mhz for card 1. I also tried setting it through the sysfs API:
Even with this, my system crashed. Interestingly it only seems to happen when switching windows. If I use a single terminal window, the crash doesn't happen. |
A few more things:
If not, than you're experiencing a different problem than the one we've been identifying here. If so, you may want to open a separate issue here on void-packages (Here it can be requested for a package downgrade assuming many are experiencing the same problem, or a patch commit if developers appear to have produced an hotfix somewhere). |
Thanks so much. You've really been very generous with your time. I've followed all your advice above. Disabling the internal GPU is a good idea. My WM is wlroots based DWL. I checked the output, and nothing seems to appear in dmesg nor my WM's output. But just looking at the amdgpu issue tracker, there's a whole load of new reports about crashing cards so I think this is not just isolated to me. But you're right, and you've given me some powerful leads to chase up. Thanks again. Indeed seems a big issue: https://gitlab.freedesktop.org/drm/amd/-/issues/3092 What's strange is everything was fine until just a week ago when I updated. Now even downgrading doesn't fix it. |
hey just commenting that I managed to get a stable configuration, and have opened a new issue here: #53787 |
Is this a new report?
Yes
System Info
Void 6.6.63_1 x86_64 GenuineIntel uptodate rFFFF
Package(s) Affected
corectrl-1.4.1_1
Does a report exist for this bug with the project's home (upstream) and/or another distro?
None Found.
Expected behaviour
No segmentation faults given a compatible ABI interface.
Actual behaviour
Since Mesa update 027f896 I'm getting segmentation faults with Corectrl,
and noticed that.configure_args = "-Ddri3=enabled"
was removed with no apparent reason that I could find (At least among discussions by Void Devs on Github. What Mesa devs may or may not have done I couldn't know). This could impact further applications if resulting in changes to the ABI interface, and that missing argument may itself be unintentionalEDIT: See comments and links to upstream, but, basically, unintentional ABI changes procuring segfaults on multiple architectures
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12253 (Segfault on AMD Polaris)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12275 (Segfault on Nvidia Quadro)
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12283 (Segfault with Gnome-Shell on Intel Integrated Graphics)
Steps to reproduce
Update to Mesa 24.3.1 027f896
Run Corectrl in a terminal
Amid other generic Qt5 errors, notice the segmentation fault at the end:
[09-12-24 19:24:34.442][I] No translation found for locale en_US [09-12-24 19:24:34.442][I] Using en_EN translation. QSystemTrayIcon::setVisible: No Icon set qt.qpa.wayland: Wayland does not support QWindow::requestActivate() zsh: segmentation fault corectrl
The text was updated successfully, but these errors were encountered: