Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDNA2+ cards seem to be underutilized, because of dual compute unit design? #186

Open
pigmej opened this issue Feb 13, 2024 · 6 comments
Open

Comments

@pigmej
Copy link
Member

pigmej commented Feb 13, 2024

From my understanding of RDNA2 architecture each RDNA2 (and newer) will report only half of CUs to the OpenCL.

(from the whitepaper)

The new dual compute unit design is the essence of the RDNA architecture and replaces the GCN compute unit as the fundamental computational building block of the GPU. As Figure 2 illustrates, the dual compute unit still comprises four SIMDs that operate independently. However, this dual compute unit was specifically designed for wave32 mode; the RDNA SIMDs include 32 ALUs, twice as wide as the vector ALUS in the prior generation, boosting performance by executing a wavefront up to 2X faster. The new SIMDs were built for mixedprecision operation and efficiently compute with a variety of datatypes to enable scientific computing and machine learning. Figure 3 below illustrates how the new dual compute unit exploits instruction-level parallelism within a simple example shader to execute on a SIMD with half the latency in wave32 mode and a 44% reduction in latency for wave64 mode compared to the previous generation GCN SIMD.

The result is that it seems that the GPUs are underutilized.

@pigmej
Copy link
Member Author

pigmej commented Feb 13, 2024

7900xtx reports:

2024-02-13T13:37:42.980+0100    INFO    selecting 0 provider from 2 available   {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 321}
2024-02-13T13:37:42.980+0100    INFO    Using provider: [GPU] AMD Accelerated Parallel Processing/gfx1100       {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 334}
2024-02-13T13:37:42.980+0100    INFO    device memory: 24560 MB, max_mem_alloc_size: 20876 MB, max_compute_units: 48, max_wg_size: 256       {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 152}
2024-02-13T13:37:43.763+0100    INFO    preferred_wg_size_multiple: 32, kernel_wg_size: 256     {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 186}
2024-02-13T13:37:43.763+0100    INFO    Using: global_work_size: 41728, local_work_size: 32     {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 199}
2024-02-13T13:37:43.763+0100    INFO    Allocating buffer for input: 32 bytes   {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 203}
2024-02-13T13:37:43.763+0100    INFO    Allocating buffer for output: 1335296 bytes     {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 211}
2024-02-13T13:37:43.763+0100    INFO    Allocating buffer for lookup: 21877489664 bytes {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 219}

so it reports 48 CU while it have 96 actually.

Please note that the preferred_wg_size_multiple: 32 also suggests that the actual num is 96.

@pigmej
Copy link
Member Author

pigmej commented Feb 13, 2024

Please note that the 48CU is only used here as log message. The important part is max_mem_alloc_size and global|local_work_size

@pigmej
Copy link
Member Author

pigmej commented Feb 16, 2024

On 7900xtx you can get more performance by setting LOOKUP_GAP to 4 and that yields about 25% perf improvement.

I think for RDNA2 cards where we have high ram to CU ratio we're packing too much ram. In comparsion nvidia 4090 allows us to use 6GiB of ram (max_mem_alloc_size) while 7900xtx 21GiB.

@jellonek
Copy link
Contributor

That's because we are using right now one scrypter (as prepared in

pub fn new(
) using only the first provider/device while looks like opencl for this card will most probably produce 2 provider/device pairs, each having 48 compute units.

If that's true (has to be tested on real HW) maybe we can create round robin list of scrypter instances then use them in parallel in async later in

self.scrypter
splitting the labels into len(devices) parts?

@poszu
Copy link
Collaborator

poszu commented Oct 31, 2024

This could be solved on a higher level by allowing parallel initialization on multiple GPUs (for example if the user has 2 cards attached to the PC). The code could distribute initialization tasks to all available devices in parallel.

@pigmej
Copy link
Member Author

pigmej commented Oct 31, 2024

It's not entirely like that. OpenCL on Windows returns one device with just half of the CU as stated above.
But regardless of that, the RDNA2 cards are currently partially wrongly supported (the work is split in the wrong fashion).

Splitting the work to two while keeping the rest of the code as is yields even worse results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants