Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low performance in linux (v0.8.5 Build 9543) #49

Open
edisonchan opened this issue Aug 1, 2024 · 7 comments
Open

low performance in linux (v0.8.5 Build 9543) #49

edisonchan opened this issue Aug 1, 2024 · 7 comments
Labels

Comments

@edisonchan
Copy link

linux 24.04
14900K SMT on:
a2d511f8e0d35766986a2201c8b0583

windows 11
4a9be9555e8c232c2970d52302e6c60

@Mysticial
Copy link
Owner

Linux is normally a few % slower. But bigger differences like this usually indicate a p-state problem. Linux has issues on Intel CPUs with not keeping the CPU clocked at full speed without special drivers, but it's hard to say what the issue is.

It could be a simple as Linux not being as good as Windows with managing P and E-core scheduling.

@edisonchan
Copy link
Author

14900K SMT off, ubuntu 24.04:

v0.7.10
image

v0.8.5
image

@Mysticial
Copy link
Owner

That's not too surprising. Alder/Raptor Lake is running the 14-BDW binary which is optimized for Broadwell and Skylake Client. So optimizations for those processors may not be beneficial on Alder/Raptor Lake.

Hybrid cores are especially bad for this kind of workload.

@Mysticial Mysticial added the bug label Sep 1, 2024
@Mysticial
Copy link
Owner

Marking this as a bug. But unless someone donates me the relevant hardware, there is nothing I can do. But again, hybrid cores are kinda impossible to optimize for.

@edisonchan
Copy link
Author

edisonchan commented Sep 2, 2024

Marking this as a bug. But unless someone donates me the relevant hardware, there is nothing I can do. But again, hybrid cores are kinda impossible to optimize for.

Are you using GOMP_CPU_AFFINITY to set the core allocation priority? AMD and Intel behave in opposite ways in this regard, Zen4 requires GOMP_CPU_AFFINITY, while Intel Hybrid architecture cannot use it (otherwise it will prioritize running on the E-core).

Alder Lake has a CPUID leaf specifically for identifying P-Core and E-Core. I tested it on Raptor Cove and it is the same:
https://www.intel.com/content/www/us/en/developer/articles/guide/12th-gen-intel-core-processor-gamedev-guide.html

#define _GNU_SOURCE
#include <stdio.h>
#include <stdint.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

/*
 * This program detects whether the CPU is a hybrid design and identifies the core types (P-core or E-core).
 * It uses the CPUID instruction to gather information about the CPU.
 * 
 * CPUID leaf 7, EDX bit 15 (isHybrid_bit) indicates a hybrid design.
 * CPUID leaf 1A, EAX bits 24-31 indicate the core type:
 *   0x20: isAtom (E-core)
 *   0x40: isCore (P-core)
 * 
 * Sources:
 * - https://stackoverflow.com/questions/69955410/how-to-detect-p-e-core-in-intel-alder-lake-cpu
 * - https://www.intel.com/content/www/us/en/developer/articles/guide/12th-gen-intel-core-processor-gamedev-guide.html
 */

// Function to execute the CPUID instruction
void cpuid(uint32_t leaf, uint32_t subleaf, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
    __asm__ __volatile__ (
        "cpuid"
        : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
        : "a" (leaf), "c" (subleaf)
    );
}

int main(int argc, char *argv[]) {
    uint32_t eax, ebx, ecx, edx;
    int cpu = -1;
    cpu_set_t cpuset;
    int num_cpus = sysconf(_SC_NPROCESSORS_ONLN);
    int pcore_start = -1, pcore_end = -1, ecore_start = -1, ecore_end = -1;
    int isHybrid_bit = 0;

    // Check CPU vendor
    cpuid(0x00000000, 0x00, &eax, &ebx, &ecx, &edx);
    char vendor[13];
    memcpy(vendor, &ebx, 4);
    memcpy(vendor + 4, &edx, 4);
    memcpy(vendor + 8, &ecx, 4);
    vendor[12] = '\0';

    if (strcmp(vendor, "GenuineIntel") == 0) {
        // Check if the CPU is hybrid
        cpuid(0x00000007, 0x00, &eax, &ebx, &ecx, &edx);
        isHybrid_bit = (edx >> 15) & 1;

        if (argc == 3 && strcmp(argv[1], "-p") == 0) {
            cpu = atoi(argv[2]);

            // Set CPU affinity to the specified core
            CPU_ZERO(&cpuset);
            CPU_SET(cpu, &cpuset);
            if (sched_setaffinity(0, sizeof(cpu_set_t), &cpuset) != 0) {
                perror("sched_setaffinity");
                return 1;
            }

            printf("CPU %d:\n", cpu);
            printf("Hybrid CPU: %s\n", isHybrid_bit ? "True" : "False");

            if (isHybrid_bit) {
                // Read core type from CPUID leaf 1A, subleaf 0
                cpuid(0x0000001A, 0x00, &eax, &ebx, &ecx, &edx);
                uint32_t core_type = (eax >> 24) & 0xFF;
                printf("Core type: 0x%02x\n", core_type);
                switch (core_type) {
                    case 0x20:
                        printf("Core type: Atom (E-core)\n");
                        break;
                    case 0x40:
                        printf("Core type: Core (P-core)\n");
                        break;
                    default:
                        printf("Core type: Reserved or unknown\n");
                        break;
                }
            } else {
                // Read core type from CPUID leaf 1A, subleaf 0
                cpuid(0x0000001A, 0x00, &eax, &ebx, &ecx, &edx);
                uint32_t core_type = (eax >> 24) & 0xFF;
                printf("Core type: 0x%02x\n", core_type);
                switch (core_type) {
                    case 0x20:
                        printf("Core type: isAtom\n");
                        break;
                    case 0x40:
                        printf("Core type: isCore\n");
                        break;
                    default:
                        printf("Core type: Reserved or unknown\n");
                        break;
                }
            }
        } else {
            printf("Hybrid CPU: %s\n", isHybrid_bit ? "True" : "False");

            if (isHybrid_bit) {
                // Enumerate all cores to determine core types
                for (int i = 0; i < num_cpus; i++) {
                    CPU_ZERO(&cpuset);
                    CPU_SET(i, &cpuset);
                    if (sched_setaffinity(0, sizeof(cpu_set_t), &cpuset) != 0) {
                        perror("sched_setaffinity");
                        return 1;
                    }

                    // Read core type from CPUID leaf 1A, subleaf 0
                    cpuid(0x0000001A, 0x00, &eax, &ebx, &ecx, &edx);
                    uint32_t core_type = (eax >> 24) & 0xFF;
                    if (core_type == 0x20) {
                        if (ecore_start == -1) ecore_start = i;
                        ecore_end = i;
                    } else if (core_type == 0x40) {
                        if (pcore_start == -1) pcore_start = i;
                        pcore_end = i;
                    }
                }

                if (pcore_start != -1 && pcore_end != -1) {
                    printf("P-Core: %d-%d\n", pcore_start, pcore_end);
                }
                if (ecore_start != -1 && ecore_end != -1) {
                    printf("E-Core: %d-%d\n", ecore_start, ecore_end);
                }
            } else {
                // Enumerate all cores to determine core types
                for (int i = 0; i < num_cpus; i++) {
                    CPU_ZERO(&cpuset);
                    CPU_SET(i, &cpuset);
                    if (sched_setaffinity(0, sizeof(cpu_set_t), &cpuset) != 0) {
                        perror("sched_setaffinity");
                        return 1;
                    }

                    // Read core type from CPUID leaf 1A, subleaf 0
                    cpuid(0x0000001A, 0x00, &eax, &ebx, &ecx, &edx);
                    uint32_t core_type = (eax >> 24) & 0xFF;
                    if (core_type == 0x20) {
                        if (ecore_start == -1) ecore_start = i;
                        ecore_end = i;
                    } else if (core_type == 0x40) {
                        if (pcore_start == -1) pcore_start = i;
                        pcore_end = i;
                    }
                }

                if (pcore_start != -1 && pcore_end != -1) {
                    printf("Core: %d-%d\n", pcore_start, pcore_end);
                }
                if (ecore_start != -1 && ecore_end != -1) {
                    printf("Atom: %d-%d\n", ecore_start, ecore_end);
                }
            }
        }
    } else if (strcmp(vendor, "AuthenticAMD") == 0) {
        printf("This is an AMD CPU.\n");
    } else {
        printf("Unknown CPU vendor: %s\n", vendor);
    }

    return 0;
}

AMD also has a similar CPUID identification bit - 0x80000026, but I haven't tested it on Zen4, it may only be used to identify ZenX and ZenXC.

@Mysticial
Copy link
Owner

y-cruncher does not touch those options (I didn't even know they exist). However, it does tries to lower the process priority to avoid freezing the system.

On Windows it's:

SetPriorityClass(GetCurrentProcess(), BELOW_NORMAL_PRIORITY_CLASS);

On some versions of Windows 10, this causes the program to run exclusively on E-core, though I heard that got fixed.

On Linux it's:

struct sched_param param;
param.sched_priority = sched_get_priority_min(SCHED_RR;
pthread_setschedparam(pthread_self(), SCHED_RR, &param);

Not sure if this will also lock the program to E-core on Linux, but it should fail anyway unless you run with sudo.

You can override the behavior by passing priority:0 as one of the early parameters (before the main option argument).

@edisonchan
Copy link
Author

you can override the behavior by passing priority:0 as one of the early parameters (before the main option argument).

priority:0 works here, the Pi time reduced about 12.7%..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants