Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed The C Loop So you don't have to. #285

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JustForCollege
Copy link

Added a casey inspired macro.

Thanks,

loops/c/code.c Outdated
int main (int argc, char** argv) {
int u = atoi(argv[1]); // Get an input number from the command line
srand(time(NULL)); // FIX random seed
int r = rand() % 10000; // Get a random integer 0 <= r < 10k
int32_t a[10000] = {0}; // Array of 10k elements initialized to 0
for (int i = 0; i < 10000; i++) { // 10k outer loop iterations
for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j%u; // Simple sum
a[i] = a[i] + REM(j,u) // Simple sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COOL, but why the compiler does not do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know to be honest, the only time I saw a compiler optimizes modules is if you tried doing module power of 2, in that case it replaces it with "and" instruction, so for example if

a = 35
b = 32

a % b = a & (b - 1)

which in this case is 3

Copy link
Collaborator

@PEZ PEZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! 🙏

Can you explain for someone not familiar with Casey? What are the trade-offs? (I am assuming there are trade-offs since the compiler doesn't do this optimization.) And please also provide benchmarks showing what effects the change has to motivate the use of a macro.

@JustForCollege
Copy link
Author

JustForCollege commented Dec 22, 2024

@PEZ @He-Pin full explanation can be found in this video (https://www.youtube.com/watch?v=RrHGX1wwSYM), thing is the original code can't be vectorized, it uses idiv instruction (for example on x86 machines) which is very slow, compared to this version also uses division however this version the compiler can vectorize it, by using SIMD (Single Instruction Multiple Data), instead of working on one piece of data one instruction can manipulate multiple data, why the compiler doesn't do that already is out of my scope to be honest.

@Ichoran Ichoran mentioned this pull request Dec 22, 2024
1 task
@alberts8
Copy link

Without enabling a more advanced instruction set like Avx2 or Avx512 this change might actually slow things down.

Additionally the same change could be made for many other languages as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants