You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On my platform, I'm using pytorch 2.3.1 for rocm 5.7.1 (pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7) this pulls triton-rocm-2.3.1. all is fine.
Take any hello world triton program and run it.
I get the following error:
Traceback (most recent call last):
File "python_environment/lib/python3.11/site-packages/triton/common/backend.py", line 96, in get_backend
importlib.import_module(device_backend_package_name, package=__spec__.name)
File "/opt/cray/pe/python/3.11.5/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "python_environment/lib/python3.11/site-packages/triton/third_party/hip/__init__.py", line 5, in <module>
register_backend("hip", HIPBackend)
File "python_environment/lib/python3.11/site-packages/triton/common/backend.py", line 88, in register_backend
_backends[device_type] = backend_cls.create_backend(device_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python_environment/lib/python3.11/site-packages/triton/common/backend.py", line 80, in create_backend
return cls(device_type)
^^^^^^^^^^^^^^^^
File "python_environment/lib/python3.11/site-packages/triton/third_party/hip/hip_backend.py", line 389, in __init__
self.driver = HIPDriver()
^^^^^^^^^^^
File "python_environment/lib/python3.11/site-packages/triton/runtime/driver.py", line 119, in __init__
self.utils = HIPUtils()
^^^^^^^^^^
File "python_environment/lib/python3.11/site-packages/triton/runtime/driver.py", line 105, in __init__
mod = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 573, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1233, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "test.py", line 90, in <module>
output_triton = add(x, y)
^^^^^^^^^
File "test.py", line 76, in add
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
File "<string>", line 25, in add_kernel
ValueError: Cannot find backend for hip
@functools.lru_cache()defrocm_path_dir():
default_path=os.path.join(os.path.dirname(__file__), "..", "third_party", "hip")
# Check if include files have been populated locally. If so, then we are # most likely in a whl installation and he rest of our libraries should be hereif (os.path.exists(default_path+"/include/hip/hip_runtime.h")):
returndefault_pathelse:
returnos.getenv("ROCM_PATH", default="/opt/rocm")
This is used to find the location of the runtime/header that triton should use to build the kernel/link against. Now in my situation it detects the default_path as valid and uses that (it resolves to .../python_environment/lib/python3.11/site-packages/triton/common/../third_party/hip/lib.
We get to the first issue:
Triton links the kernel against a rocm 6 runtime and not a rocm 5! I pulled triton from a rocm 5.7 torch.
).
This fails, the libamdhip64.so we linked against is not in LD_LIBRARY_PATH, is not even guaranteed to be installed on the system and available via ldconf's cache. Thus we get the issue mentioned first in this issue ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory.
Third issue, I think that rocm_path_dir is playing against the user because it prevents him overwriting the libraries issue mentioned above. I would advocate to reverse the priority ordering such that ROCM_PATH if found is used before trying to look at triton's own hip runtime. Or, else, provide a way to disable triton using its hip runtime.
TLDR; A triton pulled with a torch using rocm 5 uses, in fact rocm6 binaries while the user expects rocm 5. If the user has not installed rocm 6, we end up with errors of the form, 'ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory`.
A workaround, edit rocm_path_dir such that we do not use triton's libraries:
note that we can't just modify the environment variable such that we point toward triton's hip runtime. This could lead to many other hip machinery to break down (mixing rocm 5 and 6 ?).
Operating System
RHEL 8.8
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered:
Hi @etiennemlb, thanks for pointing this out. I was able to reproduce it and noticed some discrepancies in the packages being installed. When invoking pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7, the torch, torchvision, and torchaudio wheels being pulled in are specifically built for ROCm 5.7, but the pytorch-triton-rocm wheel is not associated with ROCm 5.7 and, as you have observed, appears to be built for ROCm 6.x. This wheel is also pointed to by the ROCm 6.x directories; for example, the paths to the wheels in https://download.pytorch.org/whl/rocm5.7/pytorch-triton-rocm and https://download.pytorch.org/whl/rocm6.2/pytorch-triton-rocm are identical.
From my point of view, this doesn't have anything to do with triton on our end, pytorch.org needs to supply a wheel built for ROCm 5.x or you need to build triton from source. I'll reach out internally to figure out who is responsible for providing these wheels. I do see we claim compatibility with this install method for ROCm 5.7 (https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-triton.html), so at the very least this should be changed if compatible wheels are not provided.
Hey, while it is now a bit late to fix that issue, I believe a lot of AMD GPU user would benefit from this issue not reoccurring. Especially on HPC systems that move slowly (slower than the cloud ?) and where the available ROCm versions may not be the latest.
Sorry we didn't have a timely response, I hope the workaround you used was sufficient. I agree with you and will be pushing for ROCm 5.7 triton wheels to be provided if feasible; it doesn't make sense for ROCm 5.7 wheels to install a dependency which specifically does not work on that version. Thanks again for reporting this.
Problem Description
On my platform, I'm using pytorch 2.3.1 for rocm 5.7.1 (
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
) this pulls triton-rocm-2.3.1. all is fine.Take any hello world triton program and run it.
I get the following error:
Looking into the code we get to (
triton/python/triton/common/build.py
Line 38 in cf44637
This is used to find the location of the runtime/header that triton should use to build the kernel/link against. Now in my situation it detects the
default_path
as valid and uses that (it resolves to.../python_environment/lib/python3.11/site-packages/triton/common/../third_party/hip/lib
.We get to the first issue:
Triton links the kernel against a rocm 6 runtime and not a rocm 5! I pulled triton from a rocm 5.7 torch.
$ readelf -a -W .../python_environment/lib/python3.11/site-packages/triton/common/../third_party/hip/lib/libamdhip64.so | grep soname 0x000000000000000e (SONAME) Library soname: [libamdhip64.so.6]
The second issue is that right after building the
triton/runtime/backends/hip.c
, we try to load it into python (triton/python/triton/runtime/driver.py
Line 109 in cf44637
This fails, the
libamdhip64.so
we linked against is not in LD_LIBRARY_PATH, is not even guaranteed to be installed on the system and available via ldconf's cache. Thus we get the issue mentioned first in this issueImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
.Third issue, I think that
rocm_path_dir
is playing against the user because it prevents him overwriting the libraries issue mentioned above. I would advocate to reverse the priority ordering such that ROCM_PATH if found is used before trying to look at triton's own hip runtime. Or, else, provide a way to disable triton using its hip runtime.TLDR; A triton pulled with a torch using rocm 5 uses, in fact rocm6 binaries while the user expects rocm 5. If the user has not installed rocm 6, we end up with errors of the form, 'ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory`.
A workaround, edit
rocm_path_dir
such that we do not use triton's libraries:note that we can't just modify the environment variable such that we point toward triton's hip runtime. This could lead to many other hip machinery to break down (mixing rocm 5 and 6 ?).
Operating System
RHEL 8.8
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: