Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/raja vec #83

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open

Feature/raja vec #83

wants to merge 15 commits into from

Conversation

vsrana01
Copy link

Added vectorization in stream/add

@vsrana01 vsrana01 requested a review from rhornung67 May 21, 2020 19:56
@@ -53,26 +53,25 @@ void POLYBENCH_2MM::runOpenMPVariant(VariantID vid)

POLYBENCH_2MM_VIEWS_RAJA;

auto poly_2mm_lam1 = [=](Index_type /*i*/, Index_type /*j*/, Index_type /*k*/, Real_type &dot) {
auto poly_2mm_lam1 = [=](Real_type &dot) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsrana01 have you checked whether these changes do not adversely affect performance across a range of compilers?

I don't think we want to make changes like this in this PR, since they are orthogonal. We can think about adding additional variants like this later to further stress compilers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhornung67 These changes may be needed with the latest Lambda changes: if not all segments are in active loops, the Lambda form will static_assert out

@@ -43,27 +43,24 @@ void POLYBENCH_2MM::runSeqVariant(VariantID vid)

POLYBENCH_2MM_VIEWS_RAJA;

auto poly_2mm_lam1 = [=](Index_type /*i*/, Index_type /*j*/, Index_type /*k*/, Real_type &dot) {
auto poly_2mm_lam1 = [=](Real_type &dot) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as previous one.

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsrana01 the addition of vector variants for the ADD and DAXPY kernels look good.

It's good that you are looking at eliminating the unused lambda expression arguments with the newer RAJA 'Segs', etc. stuff. But, that has to be assessed for performance carefully with as many compilers as you can try. If it's all good that stuff can come in in a separate PR.

@@ -21,6 +21,7 @@ if (PERFSUITE_ENABLE_WARNINGS)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Werror")
endif()

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mavx2")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag should go in the scripts/lc-builds/XXX files. Also I think there is a architecture agnostic flag for (at least gnu and clang) like -march=native or something... in case the machine has SSE, AVX, AVX2 or AVX512, it will pick the best one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw the lc-builds file and this is where I have it now. Not sure how this slipped in..but thanks for the feedback!

@ajkunen
Copy link

ajkunen commented May 21, 2020

Perhaps we should take a step back and do a PR for the Lamba changes first? What do you think @vsrana01 ?

@vsrana01
Copy link
Author

I agree with

Perhaps we should take a step back and do a PR for the Lamba changes first? What do you think @vsrana01 ?

@vsrana01 vsrana01 closed this May 21, 2020
@vsrana01 vsrana01 reopened this May 21, 2020
@vsrana01
Copy link
Author

@rhornung67 those changes were needed. @ajkunen I agree with you, I can go do another PR for just the Lambda changes and then a second one for the vec stuff. I can test across the different compilers with the new lambda changes and see if there is a performance difference.

@rhornung67
Copy link
Member

@vsrana01 and @ajkunen if the lambda args changes are needed for the vectorization stuff, then it may be a good idea to figure out a good way to have both variants (with and without the 'Segs' business) for the non-vector variants. My main concern is that we want to make sure both versions of each kernel perform the same for each compiler. If not, then this is a good place for vendors to mine for why they are not. But, let's not do that now.

I suggest only making additions you need to support the vector variants and leave all non-vector variants as is for now. Does that make sense?

@ajkunen
Copy link

ajkunen commented May 21, 2020

@rhornung67 i think the current RAJA develop branch imposes the Lambda requirements, which means the use of the new Lambda notation is necessary. I think if @vsrana01 does a PR for RAJAPerf with just the Lambda changes (and updated RAJA) we can see if there is a performance difference there. After that's complete, then the vector_exec work will be more narrowly scoped, and we can test that performance separately.

But we cant do the vector_exec stuff now without the Lambda stuff.

@vsrana01
Copy link
Author

@ajkunen and @rhornung67 when I build with adams vectorization branch of RAJA i get compiler errors due to the new Lambda requirements. I will start looking at the changes in performance that the kernels have on them with the new lambda requirements and create a new pr.

@rhornung67
Copy link
Member

@vsrana01 and @ajkunen OK. I misunderstood the constraints.

I think it would be best to do a PR with a new variant added (RAJA_Seq_Args) so we can assess performance. Then, move on from there. Agree?

@vsrana01
Copy link
Author

vsrana01 commented May 22, 2020

@rhornung67 @ajkunen agreed.

#define DAXPY_DATA_VEC_SETUP3 \
RAJA_INDEX_VALUE_T(I, Int_type, "I");\
using element_t = RAJA::StreamVector<Real_type,2>::element_type; \
element_t X[iend], Y[iend]; \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really isn't the intended use. You are basically telling the compiler to hold the entire array X and Y in register at the same time.
You want to keep they arrays as "Real_type X[iend], Y[iend]", and load vector-sized chunks of those arrays using element_t. You can do the load/stores either with Views+VectorIndexs, or by using the load() and store() functions in the vector or register classes.

Yview(i) += a*Xview(i);

#define DAXPY_VEC_BODY3 \
for(int i = 0;i < iend; ++i){ \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going along with the last comment: the "i" index here should be over Real_types... so it should increment by the vector width

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants