Multi-context workloads and mem_trace failing #137

cesar-avalos3 · 2024-12-16T15:26:01Z

Hello,
We found an issue with the mem_trace tool and multi-context workloads. I'm running the latest 1.7.2 release, CUDA 12.4, the mem_trace tool included with NVBit, and this sample workload (https://github.com/cesar-avalos3/simple_multi_gpu). When running the workload with mem_trace.so, we get the following assert fail:
"ASSERT FAIL: nvbit_imp.cpp:582:void Nvbit::create_ctx(CUcontext): FAIL !(tmp_dir != nullptr) MSG: temporary directory cannot be created, please make sure /tmp is writable!"
If we try the 1.5.5 release of NVBit and mem_trace, this works perfectly fine.
We tried getting around the error by overloading (via LD_PRELOAD) the offending mkdtemp, which resulted in a deadlock. No-oping nvbit_at_ctx_term allowed us to finish tracing "successfully", with the side-effect of the second context being invisible to the tracer.
We saw this behaviour in our servers (V100s) and Lambda-labs (A100) ones as well.
(Probably related to #133)
Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-context workloads and mem_trace failing #137

Multi-context workloads and mem_trace failing #137

cesar-avalos3 commented Dec 16, 2024 •

edited

Loading

Multi-context workloads and mem_trace failing #137

Multi-context workloads and mem_trace failing #137

Comments

cesar-avalos3 commented Dec 16, 2024 • edited Loading

cesar-avalos3 commented Dec 16, 2024 •

edited

Loading