-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterative solver on OpenCL (GPU) devices #199
Comments
r1349 - 3466ed7 Original comment by |
Original comment by
|
This already uses clBLAS for some time - #204 |
Another application of clBLAS is to compute inner product inside |
Another issue with clBLAS, or more generally, when the whole iteration is executed on the GPU. The only natural synchronization point is when the residual is updated (or some other scalar coefficients are computed). Therefore, timing for matrix vector product becomes completely inadequate. The only ways to fix it is either measure timing inside kernels (but I am not sure if that is possible) or add some ad hoc synchronization points. The latter may affect the performance, but not significantly (still, this can be tested). There has been similar considerations for the MPI timing, but I could not find any discussion in the issues (maybe there are some in the source code). |
This actually applies to many OpenCL issues, but here is tests of current
As a side note, we have never seriously considered CUDA, not to be limited by Nvidia GPUs. However, CUDA FFT routines showed themselves to be about 1.5 times faster than clFFT (in a limited number of tests). However, I guess that systematic comparison of those two should have been performed by others. |
Original issue reported on code.google.com by
[email protected]
on 31 May 2014 at 3:36The text was updated successfully, but these errors were encountered: