this post was submitted on 21 Oct 2025
32 points (100.0% liked)

Linux

59205 readers
407 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 6 years ago
MODERATORS
 

cross-posted from: https://lemmy.ml/post/37817953

Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).

here is my journalctl log:

Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu:  Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out

I tried to check the path /sys/class/drm/card1/device/devcoredump/data after reboot, but there isn't any thing(in fact, devcoredump folder dont even exist.

My specs: Distro: Arch Kernel: 6.17.3.arch2-1 Driver: Mesa 1:25.2.4-2 Gpu: rx 580 Cpu: r5 5500 PSU: EVGA 650 N1 650w I am on latest version of my bios)

Edit: my

Is there anything I can do to diagnose the issue? Any help is appreciated. Thanks you!

Solved my GPU is dead

top 15 comments
sorted by: hot top controversial new old
[–] frongt@lemmy.zip 8 points 5 days ago (1 children)

Distro? Driver version? Temperature? Is it receiving enough power?

If everything checks out, it might just be defective.

[–] Kiuyn@lemmy.ml 6 points 5 days ago* (last edited 5 days ago) (2 children)

Hi when I run AI model it will crash when the GPU temp is around 82C for more than a few seconds, is that because of temp, or the GPU is defective? For the other info you asked I use arch, I am on kernel 6.17.3.arch2-1and mesa 1:25.2.4-2

[–] glitching@lemmy.ml 2 points 5 days ago (1 children)

what/how do you run LLM on a RX 580? I thought ROCM was for RX 6xxx and newer?

[–] Kiuyn@lemmy.ml 2 points 5 days ago* (last edited 5 days ago)

I run it on vulkan with llama-cpp

[–] eldavi@lemmy.ml 1 points 5 days ago

one possibly expensive way to find out is to add an expensive cooling solution to it to see if it stays active.

[–] edinbruh@feddit.it 2 points 5 days ago

I experience something similar on a vega56, but it doesn't happen on generic high loads, it happens only

  • in some specific loads (for me it was mainly A Plague Tale: Requiem)
  • when something touches the grounding of my monitor (the display port of the GPU is faulty, it seems)
[–] mike_wooskey@lemmy.thewooskeys.com 3 points 5 days ago (1 children)

When I experienced the same symptoms, i eventually found out if was because ROCm didn't support having an AMD GPU as well as an AMD iGPU (iGPU is an integrated GPU, on the motherboard). Once i disabled the iGPU, those symptoms stopped.

l don't remember how i disabled the iGPU. Might have been in the bios settings, might have been a kernel parameteretc in /default/grub.

If it doesn't fix your issue, you can just re-enable the iGPU.

[–] Kiuyn@lemmy.ml 1 points 5 days ago

Hi ty, for the comment, I don't have IGPU though, so I don't think it is my issue.

[–] PetteriPano@lemmy.world 3 points 5 days ago (1 children)

I have two machines running the latest kernels on EndeavourOS. One with a Radeon RX 7900 XTX has no issues.

The other one has a Radeon 6650 XT, which since a week or two ago starts getting kworker threads stuck while throwing errors about fence queues. Load can go up to the hundreds (while there's no real load, but just blocked threads), until the machine crashes.

As I recall there was an amdgpu firmware update around the time it started happening, but the changelog on the amdgpu kernel driver hints at solving similar issues.

[–] Kiuyn@lemmy.ml 0 points 5 days ago

Hi I thank you for the information, I will try reverse version of some firmware and LTS kernel to see if the issue is still persisted.

[–] semperverus@lemmy.world 3 points 5 days ago* (last edited 5 days ago) (1 children)

This happens to me when I run games sometimes in 4k at max settings, with a 7900XTX. So far I have not found anything that prevents it, and I'm starting to suspect my power supply or my house's wiring might be the issue. It almost seems like a voltage sag.

[–] Kiuyn@lemmy.ml 2 points 5 days ago

I also started to suspect my PSU, because EVGA 650 n1 is a notorious to be a bad PSU(I only found out after quite a long time after I bought it). The problem is my GPU is also second-hand so I am not sure rn TBH.

[–] IceVAN@beehaw.org 2 points 5 days ago (1 children)

Do you have a powerful/decent/not-too-old enough PSU?.

[–] Kiuyn@lemmy.ml 1 points 5 days ago

My PSU is one year old, 650w(EVGA 650 N1. The problem is there seem to be a lot of criticism towards it.(people said it is really bad) etc.