-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: arm64 performance optimizations #4288
Conversation
I couldn't properly add |
Hi @rami-lv, thank you for the contribution! Regarding your question on adding an external dependency, it might be much easier for VW to consume a library if it's already added in vcpkg. Would you mind adding a port to vcpkg and then use it in VW? Just curious, sse2neon seems to support up to SSE4, while SIMDe supports a wider range of instruction sets including AVX2 (and partially AVX512). Would it be possible to use SIMDe for your purposes? And I think it is already ported in vcpkg. |
@zwd-ms Thanks for the suggestion I didn't know about the project. In my opinion, unless the maintainers decide to integrate SIMDe this should be temporary, and the long-term solution is to actually implement the corresponding NEON logic. As for |
Never mind I just found them. |
cmake/VWFlags.cmake
Outdated
set(LINUX_ARM64_OPT_FLAGS "") | ||
if("${CMAKE_SYSTEM_PROCESSOR}" MATCHES "aarch64|arm64|ARM64") | ||
set(LINUX_ARM64_OPT_FLAGS -mcpu=neoverse-n1) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specific to this CPU, so we should not be adding it whenever we encounter an arm CPU. Generally for specific optimization flags like this it is better to add them when configuring your own build.
So, in this case we should remove this but in your build you would add the following to your command line to achieve the same outcome:
-DCMAKE_CXX_FLAGS_RELEASE="-mcpu=neoverse-n1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to go ahead and push a change removing this piece so we can move forward here.
@@ -32,6 +32,10 @@ VW_WARNING_STATE_POP | |||
#include "vw/core/vw_versions.h" | |||
#include "vw/io/logger.h" | |||
|
|||
#if defined(__ARM_NEON) | |||
#include <sse2neon/sse2neon.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has not been made available to the vw_core
cmake target, so the build won't be able to find this. We also don't have any CI which is going to cause this path to be exercised so we need to be careful.
The include should be added here:
target_include_directories(vw_core PRIVATE ${CMAKE_CURRENT_LIST_DIR}/src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just created a package in vcpkg
for sse2neon microsoft/vcpkg#28129
Would it be okay if I add the header directly in the source code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is still preferable to use the submodule. I can push a commit to the branch with this change,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can push a commit to the branch with this change
That would be nice of you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackgerrits FYI, I added sse2neon
to vcpkg. Would you prefer that I add the package instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding that, that's super helpful! There was a tiny issue with the installed path, I went ahead and opened a fix here microsoft/vcpkg#28268. When that gets merged I will update this PR to consume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go ahead and merge this one and we can update to consume vcpkg later.
Recently at @liftoff I have been leveraging arm64 machines to train our models using vowpal wabbit.
Using arm64 in our infrastructure has helped us reduce the cost of our ML infrastructure.
To get the most out of vowpal wabbit on these machines, we added arm64 compiler optimization and transported the SIMD instructions used in vowpal wabbit via sse2neon. To do that, we followed AWS guide on how to optimize builds for arm64 machines, there are probably more optimizations to apply which require a deeper knowledge of the training algorithm.
These optimizations improved our ML pipeline time by roughly 20% (the pipeline contains steps other than training with vowpal wabbit so more experiments should be conducted if you are interested in getting the improvement on vowpal wabbit).
The proper solution would be to fill the placeholder in
lda_core.cc
with the corresponding instructions but that would require a deeper understanding of the code.