Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device info query to report support for native asserts. #2269

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aarongreig
Copy link
Contributor

This allows cuda and hip to stop reporting the opencl extension string, see issue #1374

@github-actions github-actions bot added conformance Conformance test suite issues. specification Changes or additions to the specification level-zero L0 adapter specific issues cuda CUDA adapter specific issues hip HIP adapter specific issues native-cpu Native CPU adapter specific issues labels Oct 30, 2024
@aarongreig aarongreig force-pushed the aaron/addAssertDeviceInfoQuery branch from 14acfa8 to b38264f Compare November 1, 2024 10:39
@aarongreig aarongreig changed the title Add device info query to report support for devicelib assert. Add device info query to report support for native asserts. Nov 1, 2024
@aarongreig aarongreig force-pushed the aaron/addAssertDeviceInfoQuery branch from b38264f to d5f06df Compare November 1, 2024 10:57
@aarongreig aarongreig marked this pull request as ready for review November 1, 2024 10:58
@aarongreig aarongreig requested review from a team as code owners November 1, 2024 10:58
@aarongreig aarongreig requested a review from JackAKirk November 1, 2024 10:59
@aarongreig
Copy link
Contributor Author

LLVM PR intel/llvm#15929

source/adapters/cuda/device.cpp Outdated Show resolved Hide resolved
@@ -933,6 +928,14 @@ UR_APIEXPORT ur_result_t UR_APICALL urDeviceGetInfo(ur_device_handle_t hDevice,
}
case UR_DEVICE_INFO_COMMAND_BUFFER_EVENT_SUPPORT_EXP:
return ReturnValue(false);
case UR_DEVICE_INFO_USE_NATIVE_ASSERT:
// TODO: Remove comment when HIP supports native asserts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIP supports native asserts of devices that are most relevant for the users of dpc++ AFAIK: the assert tests definitely pass on CDNA2 cards, and AMD devs implied they should be supported on all supported devices. There was a failure on a RDNA2 card but that has little relevance to dpc++ users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they still have UNSUPPORTED: hip in the e2e test along with a tracker issue, I could change this to a note about it not necessarily being 100% supported and link the issue, or I'm happy to just delete the comment if you don't think it's useful to track that here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they still have UNSUPPORTED: hip in the e2e test along with a tracker issue, I could change this to a note about it not necessarily being 100% supported and link the issue, or I'm happy to just delete the comment if you don't think it's useful to track that here

Sure, whatever you think is best. The only issue with the current comment is that it is all out of date/ factually incorrect.

As far as hip tests marked unsupported in intel/llvm, for a lot of cases this is because the intel/llvm CI does not use an officially supported rocm card. This case is a bit different because the failure still does occur on some officially supported cards in addition to the unsupported ones but I think your wish to add a comment to state something along the lines of "this is mostly supported" is fine.

It is best to try not to take too much account of such failures, unfortunately we have to rely on other testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to omit the comment, looking at intel/llvm#7634 it does seem like the remaining fail there is a weird issue with how the hardware is set up, and based on Georgi's comment it sounds like confidence is pretty high that this should be upgraded to an unqualified "supported"

@GeorgeWeb
Copy link
Contributor

GeorgeWeb commented Nov 1, 2024

@aarongreig According to my latest testing a CDNA2 AMD GPU had reliable native asserts support with latest versions of ROCm. However, I am not sure how well things are with consumer grade ones, but I suspect all supported cards should in theory have support for the feature these days.

Just a question/thought on how this ties to DPC++, but not a requirement:
I am wondering whether we need an intel/llvm PR for this too to update the device config file here https://github.com/intel/llvm/blob/d13019638e4ed3078a52cf06a91080875d2e0d0e/llvm/include/llvm/SYCLLowerIR/DeviceConfigFile.td#L293 wrt native_assert aspect support for HIP. This Device Configuration File is used for SYCL/DPC++ all_devices_have<sycl::aspect> and any_device_has<sycl::aspect> compile-time aspect traits implementation (see https://intel.github.io/llvm/design/DeviceAspectTraitDesign.html). Maybe we can selectively add AspectExt_oneapi_native_assert to all of the Officially supported: HIP-AMD targets regardless of whether they are CDNA arches or whether we tested them all (because it's impossible for us to fully test that on all gpus so we ought to rely on ROCm actually supporting their supported hardware). Until now I've only added it to gfx90a because that's the only one I knew of passing the test reliably 🤷‍♂️.

@aarongreig aarongreig force-pushed the aaron/addAssertDeviceInfoQuery branch from d5f06df to df32785 Compare November 1, 2024 15:03
Copy link
Contributor

@JackAKirk JackAKirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for cuda/hip

@aarongreig
Copy link
Contributor Author

@aarongreig According to my latest testing a CDNA2 AMD GPU had reliable native asserts support with latest versions of ROCm. However, I am not sure how well things are with consumer grade ones, but I suspect all supported cards should in theory have support for the feature these days.

Just a question/thought for this ties to DPC++, but not a requirement: I am wondering whether we need an intel/llvm PR for this too to update the device config file here https://github.com/intel/llvm/blob/d13019638e4ed3078a52cf06a91080875d2e0d0e/llvm/include/llvm/SYCLLowerIR/DeviceConfigFile.td#L293 wrt native_assert aspect support for HIP. This Device Configuration File is used for SYCL/DPC++ all_devices_have<sycl::aspect> and any_device_has<sycl::aspect> compile-time aspect traits implementation (see https://intel.github.io/llvm/design/DeviceAspectTraitDesign.html). Maybe we can selectively add AspectExt_oneapi_native_assert to all of the Officially supported: HIP-AMD targets regardless of whether they are CDNA arches or whether we tested them all (because it's impossible for us to fully test that on all gpus so we ought to rely on ROCm actually supporting their supported hardware). Until now I've only added it to gfx90a because that's the only one I knew of passing the test reliably 🤷‍♂️.

That probably does need to be updated, although I don't really want to include that in intel/llvm#15929 since it's supposed to just be updating how the internal reporting mechanism works

@GeorgeWeb
Copy link
Contributor

That probably does need to be updated, although I don't really want to include that in intel/llvm#15929 since it's supposed to just be updating how the internal reporting mechanism works

Yeah, I agree. Guess the comment was more of a note to refer to in a future follow-up to that specific part of intel/llvm.

Regardless, all cuda/hip related changes look good! Cheers.

Copy link
Contributor

@nrspruit nrspruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for L0

This allows cuda and hip to stop reporting the relevant opencl extension
string, see issue oneapi-src#1374
@aarongreig aarongreig force-pushed the aaron/addAssertDeviceInfoQuery branch from df32785 to 6a711b3 Compare November 8, 2024 11:44
@aarongreig
Copy link
Contributor Author

ping @oneapi-src/unified-runtime-native-cpu-write @oneapi-src/unified-runtime-opencl-write

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conformance Conformance test suite issues. cuda CUDA adapter specific issues hip HIP adapter specific issues level-zero L0 adapter specific issues native-cpu Native CPU adapter specific issues specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants