-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed The C Loop So you don't have to. #285
base: main
Are you sure you want to change the base?
Conversation
loops/c/code.c
Outdated
int main (int argc, char** argv) { | ||
int u = atoi(argv[1]); // Get an input number from the command line | ||
srand(time(NULL)); // FIX random seed | ||
int r = rand() % 10000; // Get a random integer 0 <= r < 10k | ||
int32_t a[10000] = {0}; // Array of 10k elements initialized to 0 | ||
for (int i = 0; i < 10000; i++) { // 10k outer loop iterations | ||
for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration | ||
a[i] = a[i] + j%u; // Simple sum | ||
a[i] = a[i] + REM(j,u) // Simple sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COOL, but why the compiler does not do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know to be honest, the only time I saw a compiler optimizes modules is if you tried doing module power of 2, in that case it replaces it with "and" instruction, so for example if
a = 35
b = 32
a % b = a & (b - 1)
which in this case is 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! 🙏
Can you explain for someone not familiar with Casey? What are the trade-offs? (I am assuming there are trade-offs since the compiler doesn't do this optimization.) And please also provide benchmarks showing what effects the change has to motivate the use of a macro.
@PEZ @He-Pin full explanation can be found in this video (https://www.youtube.com/watch?v=RrHGX1wwSYM), thing is the original code can't be vectorized, it uses idiv instruction (for example on x86 machines) which is very slow, compared to this version also uses division however this version the compiler can vectorize it, by using SIMD (Single Instruction Multiple Data), instead of working on one piece of data one instruction can manipulate multiple data, why the compiler doesn't do that already is out of my scope to be honest. |
Without enabling a more advanced instruction set like Avx2 or Avx512 this change might actually slow things down. Additionally the same change could be made for many other languages as well. |
Added a casey inspired macro.
Thanks,