perf: arm64 performance optimizations #4288

rami-lv · 2022-11-24T16:27:28Z

Recently at @liftoff I have been leveraging arm64 machines to train our models using vowpal wabbit.
Using arm64 in our infrastructure has helped us reduce the cost of our ML infrastructure.

To get the most out of vowpal wabbit on these machines, we added arm64 compiler optimization and transported the SIMD instructions used in vowpal wabbit via sse2neon. To do that, we followed AWS guide on how to optimize builds for arm64 machines, there are probably more optimizations to apply which require a deeper knowledge of the training algorithm.

These optimizations improved our ML pipeline time by roughly 20% (the pipeline contains steps other than training with vowpal wabbit so more experiments should be conducted if you are interested in getting the improvement on vowpal wabbit).

The proper solution would be to fill the placeholder in lda_core.cc with the corresponding instructions but that would require a deeper understanding of the code.

rami-lv · 2022-11-24T16:28:07Z

I couldn't properly add sse2neno as an external dependency, I would appreciate some help on that.

zwd-ms · 2022-11-29T23:09:35Z

Hi @rami-lv, thank you for the contribution!

Regarding your question on adding an external dependency, it might be much easier for VW to consume a library if it's already added in vcpkg. Would you mind adding a port to vcpkg and then use it in VW?

Just curious, sse2neon seems to support up to SSE4, while SIMDe supports a wider range of instruction sets including AVX2 (and partially AVX512). Would it be possible to use SIMDe for your purposes? And I think it is already ported in vcpkg.

rami-lv · 2022-12-02T13:28:19Z

@zwd-ms Thanks for the suggestion I didn't know about the project.
However, integrating SIMDe means that we have to rename the intrinsics calls in lda_core.cc to match those of SIMDe which might be not preferable for the maintainers.
Also, SIMDe has merged all sse2neon code so there is no advantage of using one or another.

In my opinion, unless the maintainers decide to integrate SIMDe this should be temporary, and the long-term solution is to actually implement the corresponding NEON logic.

As for AVX2 I couldn't find any usage of AVX intrinsics through the code, did you?

rami-lv · 2022-12-02T13:55:33Z

As for AVX2 I couldn't find any usage of AVX intrinsics through the code, did you?

Never mind I just found them.

jackgerrits · 2022-12-02T15:23:15Z

cmake/VWFlags.cmake

+set(LINUX_ARM64_OPT_FLAGS "")
+if("${CMAKE_SYSTEM_PROCESSOR}" MATCHES "aarch64|arm64|ARM64")
+  set(LINUX_ARM64_OPT_FLAGS -mcpu=neoverse-n1)
+endif()


This is specific to this CPU, so we should not be adding it whenever we encounter an arm CPU. Generally for specific optimization flags like this it is better to add them when configuring your own build.

So, in this case we should remove this but in your build you would add the following to your command line to achieve the same outcome:

-DCMAKE_CXX_FLAGS_RELEASE="-mcpu=neoverse-n1"

I'm going to go ahead and push a change removing this piece so we can move forward here.

jackgerrits · 2022-12-02T15:31:42Z

vowpalwabbit/core/src/reductions/lda_core.cc

@@ -32,6 +32,10 @@ VW_WARNING_STATE_POP
 #include "vw/core/vw_versions.h"
 #include "vw/io/logger.h"

+#if defined(__ARM_NEON)
+#include <sse2neon/sse2neon.h>


This has not been made available to the vw_core cmake target, so the build won't be able to find this. We also don't have any CI which is going to cause this path to be exercised so we need to be careful.

The include should be added here:

vowpal_wabbit/vowpalwabbit/core/CMakeLists.txt

Line 372 in 3ee9665

target_include_directories(vw_core PRIVATE ${CMAKE_CURRENT_LIST_DIR}/src)

I just created a package in vcpkg for sse2neon microsoft/vcpkg#28129

Would it be okay if I add the header directly in the source code?

I think it is still preferable to use the submodule. I can push a commit to the branch with this change,

I can push a commit to the branch with this change

That would be nice of you

@jackgerrits FYI, I added sse2neon to vcpkg. Would you prefer that I add the package instead?

Thanks for adding that, that's super helpful! There was a tiny issue with the installed path, I went ahead and opened a fix here microsoft/vcpkg#28268. When that gets merged I will update this PR to consume.

I'll go ahead and merge this one and we can update to consume vcpkg later.

fix format

rami-lv added 2 commits November 24, 2022 16:03

Add arm64 optmizations flags

8ec5ffa

Transport SSE intrinsics through sse2neon on ARM

ec46d50

rami-lv marked this pull request as ready for review November 24, 2022 16:28

rami-lv changed the title ~~Arm64 performance optimizations~~ Add arm64 performance optimizations Nov 24, 2022

rami-lv changed the title ~~Add arm64 performance optimizations~~ [perf ] arm64 performance optimizations Nov 24, 2022

rami-lv changed the title ~~[perf ] arm64 performance optimizations~~ perf: arm64 performance optimizations Nov 24, 2022

jackgerrits requested changes Dec 2, 2022

View reviewed changes

jackgerrits and others added 3 commits December 2, 2022 13:13

expose sse2neon to cmake

8150684

Update VWFlags.cmake

ee6a431

Merge branch 'master' into master

98d3d44

zwd-ms approved these changes Dec 12, 2022

View reviewed changes

jackgerrits self-requested a review December 12, 2022 22:41

jackgerrits approved these changes Dec 12, 2022

View reviewed changes

zwd-ms added 3 commits December 12, 2022 17:46

Update lda_core.cc

941c896

fix format

Update lda_core.cc

01c560c

Update lda_core.cc

ba88bf3

zwd-ms enabled auto-merge (squash) December 12, 2022 22:57

zwd-ms merged commit ca0c23c into VowpalWabbit:master Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: arm64 performance optimizations #4288

perf: arm64 performance optimizations #4288

rami-lv commented Nov 24, 2022 •

edited

Loading

rami-lv commented Nov 24, 2022

zwd-ms commented Nov 29, 2022 •

edited

Loading

rami-lv commented Dec 2, 2022

rami-lv commented Dec 2, 2022

jackgerrits Dec 2, 2022 •

edited

Loading

jackgerrits Dec 9, 2022

jackgerrits Dec 2, 2022

rami-lv Dec 2, 2022

jackgerrits Dec 2, 2022

rami-lv Dec 2, 2022

rami-lv Dec 9, 2022

jackgerrits Dec 9, 2022

jackgerrits Dec 12, 2022

perf: arm64 performance optimizations #4288

perf: arm64 performance optimizations #4288

Conversation

rami-lv commented Nov 24, 2022 • edited Loading

rami-lv commented Nov 24, 2022

zwd-ms commented Nov 29, 2022 • edited Loading

rami-lv commented Dec 2, 2022

rami-lv commented Dec 2, 2022

jackgerrits Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rami-lv commented Nov 24, 2022 •

edited

Loading

zwd-ms commented Nov 29, 2022 •

edited

Loading

jackgerrits Dec 2, 2022 •

edited

Loading