Vulkan: Destroy Vulkan instance on exit #10989

0cc4m · 2024-12-26T19:17:03Z

This seems to fix the Nvidia driver segfault on exit (#10528). If someone can reproduce it reliably, please test if this resolves it.

It's a little more hacky than I had hoped, but I couldn't think of a better way to check when the instance can be destroyed.

Basically the backend is now counting backend contexts and backend (device) buffer allocations and when a backend or buffer is destroyed checks whether any are left. If not, it destroys the instance. This seems to happen early enough that it avoids the issue with the Nvidia driver. Usually it should happen on a backend unload or model unload, depending on which happens last.

Let me know if you see any issues, I had to touch a lot of code to remove references to the devices that would prevent them and thus the instance from being destroyed.

jeffbolznv · 2024-12-26T19:28:40Z

Have you tried this on Windows? What I was seeing when I looked into this recently is that the static destructors in ggml-vulkan are run after the Vulkan driver is unloaded, so things crash if you try to call into the driver at all then. I was specifically seeing it crash when vkDestroyDevice is called.

0cc4m · 2024-12-26T19:39:16Z

Have you tried this on Windows? What I was seeing when I looked into this recently is that the static destructors in ggml-vulkan are run after the Vulkan driver is unloaded, so things crash if you try to call into the driver at all then. I was specifically seeing it crash when vkDestroyDevice is called.

No, I don't have a Windows setup. But this does not rely on the static destructors on exit, it should destroy the devices and instance once the controlling program unloaded the backend and freed the device buffers (unloaded the model) At least for the examples I tested this seems to fix the segfault on Linux.

jeffbolznv · 2024-12-26T20:01:46Z

Cool! I fetched your change and I'm not seeing the crash - the backend is indeed freed before the process is terminated.

jeffbolznv · 2024-12-27T21:30:18Z

I've had this running overnight on a Linux system and no crashes so far. But the previous repro rate was maybe once a day, so not definitive. I can keep running it for a couple more days.

LostRuins · 2024-12-28T05:24:10Z

Unfortunately, this did not solve the BSOD issue for me. Loading the models was fine, but upon unload I got a VIDEO_MEMORY_MANAGEMENT_INTERNAL BSOD again. Again, it only happens whenever I offload enough layers that available VRAM is nearly depleted (> 90% vram utilization) otherwise it will not BSOD.

(sorry for poor quality)

and here's BlueScreenView's info for this minidump

jeffbolznv · 2024-12-28T05:47:04Z

IMO the VIDEO_MEMORY_MANAGEMENT_INTERNAL error is likely unrelated to the crash in driver threads.

LostRuins · 2024-12-28T08:41:33Z

Do you think it's something that can be fixed on the llama.cpp side? Or is it a bug in the graphics driver? I'm running it in user mode without admin permissions, so it shouldn't be able to trigger a BSOD under normal circumstances correct?

jeffbolznv · 2024-12-28T19:01:32Z

Do you think it's something that can be fixed on the llama.cpp side? Or is it a bug in the graphics driver? I'm running it in user mode without admin permissions, so it shouldn't be able to trigger a BSOD under normal circumstances correct?

I'd guess it's one of: OS bug, kernel driver bug, or hardware failure (e.g. memory corruption).

jeffbolznv · 2024-12-29T18:11:21Z

My borrowed Linux system has been running for 3 days now with no failures.

0cc4m · 2024-12-29T19:29:33Z

I also haven't seen any segfaults with it and I get them quite often. I'll rebase it, then it's probably good to merge, unless you see some issue with the code.

…n program exit

jeffbolznv · 2024-12-30T00:26:05Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp


    ~vk_buffer_struct() {
-        if (size == 0) {
+        if (size == 0 || device_ref.expired()) {


If we get here with an expired pointer, does that indicate that we failed to free the VkBuffer/VkDeviceMemory? Is there any way to avoid that?

This case happens in the vk_device_struct deconstructor, there is still the sync_staging buffer around. But at that point the weak pointer is invalid, so I added a ggml_vk_destroy_buffer overload that uses the vk::Device directly and added this check. There might be a better way to handle that.

jeffbolznv · 2024-12-30T00:33:06Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+
+    vk_instance_buffers--;
+
+    if (vk_instance_contexts == 0 && vk_instance_buffers == 0) {


If these counters really need to be atomic, then this check should use the return value from the decrement rather than separately reading the value, to avoid the situation where two threads decrement and then both read zero. But with two counters, it gets more complex and maybe a mutex would be better?

They don't. In the beginning I had only one counter, so it made more sense. I replaced them with a mutex now.

There's still the open issue about making the Vulkan backend multithread-capable, which needs some refactoring. I'm trying not to make the situation worse when I add new features.

ggerganov · 2025-01-02T11:27:17Z

Hm, ggml-ci still seems to segfault using this PR. Here is one of the tests that fails on exit:

ggml@ggml-6-x86-vulkan-t4:~/work/llama.cpp/build-ci-debug$ ./bin/test-arg-parser 
test-arg-parser: make sure there is no duplicated arguments in any examples

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Tesla T4 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Tesla T4)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 7V12 64-Core Processor)
test-arg-parser: test invalid usage

error while handling argument "-m": expected value for argument

usage:
-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)


to show complete usage, run with -h
error while handling argument "-ngl": stoi

usage:
-ngl,  --gpu-layers, --n-gpu-layers N   number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)


to show complete usage, run with -h
error while handling argument "-sm": invalid value

usage:
-sm,   --split-mode {none,layer,row}    how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
                                        (env: LLAMA_ARG_SPLIT_MODE)


to show complete usage, run with -h
error: invalid argument: --draft
test-arg-parser: test valid usage

test-arg-parser: test environment variables (valid + invalid usages)

error while handling environment variable "LLAMA_ARG_THREADS": stoi


test-arg-parser: test environment variables being overwritten

warn: LLAMA_ARG_MODEL environment variable is set, but will be overwritten by command line argument -m
test-arg-parser: all tests OK

Segmentation fault (core dumped)
...

0cc4m · 2025-01-02T13:58:16Z

I haven't merged this yet because I can reproduce the crash reported in #10528 (comment) , but couldn't figure out why it occurs, yet.

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 26, 2024

0cc4m mentioned this pull request Dec 29, 2024

Misc. bug: Inconsistent Vulkan segfault #10528

Open

Vulkan: Refactor to make sure Vulkan instance is destroyed properly o…

d9b0958

…n program exit

0cc4m force-pushed the 0cc4m/vulkan-instance-cleanup branch from 01ab364 to d9b0958 Compare December 29, 2024 20:41

jeffbolznv approved these changes Dec 30, 2024

View reviewed changes

Use mutex instead of atomics for vk_instance counters

aa014d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Destroy Vulkan instance on exit #10989

Vulkan: Destroy Vulkan instance on exit #10989

0cc4m commented Dec 26, 2024

jeffbolznv commented Dec 26, 2024

0cc4m commented Dec 26, 2024

jeffbolznv commented Dec 26, 2024

jeffbolznv commented Dec 27, 2024

LostRuins commented Dec 28, 2024 •

edited

Loading

jeffbolznv commented Dec 28, 2024

LostRuins commented Dec 28, 2024

jeffbolznv commented Dec 28, 2024

jeffbolznv commented Dec 29, 2024

0cc4m commented Dec 29, 2024

jeffbolznv Dec 30, 2024

0cc4m Dec 30, 2024

jeffbolznv Dec 30, 2024

0cc4m Dec 30, 2024

ggerganov commented Jan 2, 2025

0cc4m commented Jan 2, 2025


		vk_instance_buffers--;

		if (vk_instance_contexts == 0 && vk_instance_buffers == 0) {

Vulkan: Destroy Vulkan instance on exit #10989

Are you sure you want to change the base?

Vulkan: Destroy Vulkan instance on exit #10989

Conversation

0cc4m commented Dec 26, 2024

jeffbolznv commented Dec 26, 2024

0cc4m commented Dec 26, 2024

jeffbolznv commented Dec 26, 2024

jeffbolznv commented Dec 27, 2024

LostRuins commented Dec 28, 2024 • edited Loading

jeffbolznv commented Dec 28, 2024

LostRuins commented Dec 28, 2024

jeffbolznv commented Dec 28, 2024

jeffbolznv commented Dec 29, 2024

0cc4m commented Dec 29, 2024

jeffbolznv Dec 30, 2024

Choose a reason for hiding this comment

0cc4m Dec 30, 2024

Choose a reason for hiding this comment

jeffbolznv Dec 30, 2024

Choose a reason for hiding this comment

0cc4m Dec 30, 2024

Choose a reason for hiding this comment

ggerganov commented Jan 2, 2025

0cc4m commented Jan 2, 2025

LostRuins commented Dec 28, 2024 •

edited

Loading