-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/raja vec #83
base: develop
Are you sure you want to change the base?
Feature/raja vec #83
Conversation
src/polybench/POLYBENCH_2MM-OMP.cpp
Outdated
@@ -53,26 +53,25 @@ void POLYBENCH_2MM::runOpenMPVariant(VariantID vid) | |||
|
|||
POLYBENCH_2MM_VIEWS_RAJA; | |||
|
|||
auto poly_2mm_lam1 = [=](Index_type /*i*/, Index_type /*j*/, Index_type /*k*/, Real_type &dot) { | |||
auto poly_2mm_lam1 = [=](Real_type &dot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsrana01 have you checked whether these changes do not adversely affect performance across a range of compilers?
I don't think we want to make changes like this in this PR, since they are orthogonal. We can think about adding additional variants like this later to further stress compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhornung67 These changes may be needed with the latest Lambda changes: if not all segments are in active loops, the Lambda form will static_assert out
src/polybench/POLYBENCH_2MM-Seq.cpp
Outdated
@@ -43,27 +43,24 @@ void POLYBENCH_2MM::runSeqVariant(VariantID vid) | |||
|
|||
POLYBENCH_2MM_VIEWS_RAJA; | |||
|
|||
auto poly_2mm_lam1 = [=](Index_type /*i*/, Index_type /*j*/, Index_type /*k*/, Real_type &dot) { | |||
auto poly_2mm_lam1 = [=](Real_type &dot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsrana01 the addition of vector variants for the ADD and DAXPY kernels look good.
It's good that you are looking at eliminating the unused lambda expression arguments with the newer RAJA 'Segs', etc. stuff. But, that has to be assessed for performance carefully with as many compilers as you can try. If it's all good that stuff can come in in a separate PR.
@@ -21,6 +21,7 @@ if (PERFSUITE_ENABLE_WARNINGS) | |||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Werror") | |||
endif() | |||
|
|||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mavx2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag should go in the scripts/lc-builds/XXX files. Also I think there is a architecture agnostic flag for (at least gnu and clang) like -march=native or something... in case the machine has SSE, AVX, AVX2 or AVX512, it will pick the best one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw the lc-builds file and this is where I have it now. Not sure how this slipped in..but thanks for the feedback!
Perhaps we should take a step back and do a PR for the Lamba changes first? What do you think @vsrana01 ? |
I agree with
|
@rhornung67 those changes were needed. @ajkunen I agree with you, I can go do another PR for just the Lambda changes and then a second one for the vec stuff. I can test across the different compilers with the new lambda changes and see if there is a performance difference. |
@vsrana01 and @ajkunen if the lambda args changes are needed for the vectorization stuff, then it may be a good idea to figure out a good way to have both variants (with and without the 'Segs' business) for the non-vector variants. My main concern is that we want to make sure both versions of each kernel perform the same for each compiler. If not, then this is a good place for vendors to mine for why they are not. But, let's not do that now. I suggest only making additions you need to support the vector variants and leave all non-vector variants as is for now. Does that make sense? |
@rhornung67 i think the current RAJA develop branch imposes the Lambda requirements, which means the use of the new Lambda notation is necessary. I think if @vsrana01 does a PR for RAJAPerf with just the Lambda changes (and updated RAJA) we can see if there is a performance difference there. After that's complete, then the vector_exec work will be more narrowly scoped, and we can test that performance separately. But we cant do the vector_exec stuff now without the Lambda stuff. |
@ajkunen and @rhornung67 when I build with adams vectorization branch of RAJA i get compiler errors due to the new Lambda requirements. I will start looking at the changes in performance that the kernels have on them with the new lambda requirements and create a new pr. |
@rhornung67 @ajkunen agreed. |
#define DAXPY_DATA_VEC_SETUP3 \ | ||
RAJA_INDEX_VALUE_T(I, Int_type, "I");\ | ||
using element_t = RAJA::StreamVector<Real_type,2>::element_type; \ | ||
element_t X[iend], Y[iend]; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really isn't the intended use. You are basically telling the compiler to hold the entire array X and Y in register at the same time.
You want to keep they arrays as "Real_type X[iend], Y[iend]", and load vector-sized chunks of those arrays using element_t. You can do the load/stores either with Views+VectorIndexs, or by using the load() and store() functions in the vector or register classes.
Yview(i) += a*Xview(i); | ||
|
||
#define DAXPY_VEC_BODY3 \ | ||
for(int i = 0;i < iend; ++i){ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going along with the last comment: the "i" index here should be over Real_types... so it should increment by the vector width
updating polybench
Added vectorization in stream/add