Feature/CPU Detection for Apple M1 #40876

chriselrod · 2021-05-19T13:45:43Z

Originally posted here.

The Apple M1 supports ARMv8.4-A, but Julia/LLVM treats it like an A7/Cyclone CPU:

julia> versioninfo()
Julia Version 1.7.0-DEV.1107
Commit 5aca7a37be* (2021-05-15 16:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

Which is ARMv8-a. Although the page on the A14 claims it is ARMv8.5-a. for the firestorm/icestorm cores.

As such, atomics are implemented using a load link/conditional store loop:

julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> @code_native Threads.atomic_add!(a, 2)

        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
        mov     x8, x0
L4:
        ldaxr   x0, [x8]
        add     x9, x0, x1
        stlxr   w10, x9, [x8]
        cbnz    w10, L4
        ret
; └

julia> @code_native Threads.atomic_cas!(a, 5, 2)

        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
        mov     x8, x0
L4:
        ldaxr   x0, [x8]
        cmp     x0, x1
        b.ne    L28
        stlxr   w9, x2, [x8]
        cbnz    w9, L4
        ret
L28:
        clrex
        ret
; └

However, if I start Julia with -C'armv8.4-a':

julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> @code_native Threads.atomic_add!(a, 2)

        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
        ldaddal x1, x0, [x0]
        ret
; └

julia> @code_native Threads.atomic_cas!(a, 5, 2)

        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
        casal   x1, x2, [x0]
        mov     x0, x1
        ret
; └

Starting Julia without -C flags:

julia> using Octavian

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000087 seconds (2 allocations: 40.578 KiB)

julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.425 μs (0.00% GC)
  median time:      6.525 μs (0.00% GC)
  mean time:        6.530 μs (0.00% GC)
  maximum time:     14.592 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

With -C'armv8.4-a':

julia> using Octavian

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000100 seconds (2 allocations: 40.578 KiB)

julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.258 μs (0.00% GC)
  median time:      6.525 μs (0.00% GC)
  mean time:        6.532 μs (0.00% GC)
  maximum time:     13.475 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

I made non-x86 architectures (including the M1) ramp up thread use more slowly, because earlier performance tests suggested the M1 had higher threading overhead. Maybe that was partly because of atomics, and partly because of the lack of a shared L3 cache, and of course maybe for other reasons I don't know.

There's of course more than just atomics separating armv8.(4/5)-a and armv8.

The text was updated successfully, but these errors were encountered:

gbaraldi · 2021-08-18T03:13:10Z

How hard would fixing this be? Would I be able to do it?
Go did via hardcoding the cpu features golang/go#42747

I imagine it's adding some hardcoded options to here:

julia/src/processor_arm.cpp

Lines 1184 to 1308 in a08a3ff

    
           static NOINLINE std::pair<uint32_t,FeatureList<feature_sz>> _get_host_cpu() 
        
           { 
        
               FeatureList<feature_sz> features = {}; 
        
               // Here we assume that only the lower 32bit are used on aarch64 
        
               // Change the cast here when that's not the case anymore (and when there's features in the 
        
               // high bits that we want to detect). 
        
               features[0] = (uint32_t)jl_getauxval(AT_HWCAP); 
        
               features[1] = (uint32_t)jl_getauxval(AT_HWCAP2); 
        
           #ifdef _CPU_AARCH64_ 
        
               if (test_nbit(features, 31)) // HWCAP_PACG 
        
                   set_bit(features, Feature::pauth, true); 
        
           #endif 
        
               auto cpuinfo = get_cpuinfo(); 
        
               auto arch = get_elf_arch(); 
        
           #ifdef _CPU_ARM_ 
        
               if (arch.version >= 7) { 
        
                   if (arch.klass == 'M') { 
        
                       set_bit(features, Feature::mclass, true); 
        
                   } 
        
                   else if (arch.klass == 'R') { 
        
                       set_bit(features, Feature::rclass, true); 
        
                   } 
        
                   else if (arch.klass == 'A') { 
        
                       set_bit(features, Feature::aclass, true); 
        
                   } 
        
               } 
        
               switch (arch.version) { 
        
               case 8: 
        
                   set_bit(features, Feature::v8, true); 
        
                   JL_FALLTHROUGH; 
        
               case 7: 
        
                   set_bit(features, Feature::v7, true); 
        
                   break; 
        
               default: 
        
                   break; 
        
               } 
        
           #endif 
        
               std::set<uint32_t> cpus; 
        
               std::vector<std::pair<uint32_t,CPUID>> list; 
        
               // Ideally the feature detection above should be enough. 
        
               // However depending on the kernel version not all features are available 
        
               // and it's also impossible to detect the ISA version which contains 
        
               // some features not yet exposed by the kernel. 
        
               // We therefore try to get a more complete feature list from the CPU name. 
        
               // Since it is possible to pair cores that have different feature set 
        
               // (Observed for exynos 9810 with exynos-m3 + cortex-a55) we'll compute 
        
               // an intersection of the known features from each core. 
        
               // If there's a core that we don't recognize, treat it as generic. 
        
               bool extra_initialized = false; 
        
               FeatureList<feature_sz> extra_features = {}; 
        
               for (auto info: cpuinfo) { 
        
                   auto name = (uint32_t)get_cpu_name(info); 
        
                   if (name == 0) { 
        
                       // no need to clear the feature set if it wasn't initialized 
        
                       if (extra_initialized) 
        
                           extra_features = FeatureList<feature_sz>{}; 
        
                       extra_initialized = true; 
        
                       continue; 
        
                   } 
        
                   if (!check_cpu_arch_ver(name, arch)) 
        
                       continue; 
        
                   if (cpus.insert(name).second) { 
        
                       if (extra_initialized) { 
        
                           extra_features = extra_features & find_cpu(name)->features; 
        
                       } 
        
                       else { 
        
                           extra_initialized = true; 
        
                           extra_features = find_cpu(name)->features; 
        
                       } 
        
                       list.emplace_back(name, info); 
        
                   } 
        
               } 
        
               features = features | extra_features; 
        
               // Not all elements/pairs are valid 
        
               static constexpr CPU v8order[] = { 
        
                   CPU::arm_cortex_a35, 
        
                   CPU::arm_cortex_a53, 
        
                   CPU::arm_cortex_a55, 
        
                   CPU::arm_cortex_a57, 
        
                   CPU::arm_cortex_a72, 
        
                   CPU::arm_cortex_a73, 
        
                   CPU::arm_cortex_a75, 
        
                   CPU::arm_cortex_a76, 
        
                   CPU::arm_neoverse_n1, 
        
                   CPU::arm_neoverse_n2, 
        
                   CPU::arm_neoverse_v1, 
        
                   CPU::nvidia_denver2, 
        
                   CPU::nvidia_carmel, 
        
                   CPU::samsung_exynos_m1, 
        
                   CPU::samsung_exynos_m2, 
        
                   CPU::samsung_exynos_m3, 
        
                   CPU::samsung_exynos_m4, 
        
                   CPU::samsung_exynos_m5, 
        
               }; 
        
               shrink_big_little(list, v8order, sizeof(v8order) / sizeof(CPU)); 
        
           #ifdef _CPU_ARM_ 
        
               // Not all elements/pairs are valid 
        
               static constexpr CPU v7order[] = { 
        
                   CPU::arm_cortex_a5, 
        
                   CPU::arm_cortex_a7, 
        
                   CPU::arm_cortex_a8, 
        
                   CPU::arm_cortex_a9, 
        
                   CPU::arm_cortex_a12, 
        
                   CPU::arm_cortex_a15, 
        
                   CPU::arm_cortex_a17 
        
               }; 
        
               shrink_big_little(list, v7order, sizeof(v7order) / sizeof(CPU)); 
        
           #endif 
        
               uint32_t cpu = 0; 
        
               if (list.empty()) { 
        
                   cpu = (uint32_t)generic_for_arch(arch); 
        
               } 
        
               else { 
        
                   // This also covers `list.size() > 1` case which means there's a unknown combination 
        
                   // consists of CPU's we know. Unclear what else we could try so just randomly return 
        
                   // one... 
        
                   cpu = list[0].first; 
        
               } 
        
               // Ignore feature bits that we are not interested in. 
        
               mask_features(feature_masks, &features[0]); 
        
               return std::make_pair(cpu, features); 
        
           }

It could be possible to do it programmatically using developer.apple.com/documentation/kernel/1387446-sysctlbyname but that would necessitate a refactor of the code since I think it just expects linux code for now.

giordano · 2022-04-06T21:07:02Z

@chriselrod I presume this was fixed by #41924?

julia> versioninfo()
Julia Version 1.9.0-DEV.332
Commit 559244b383* (2022-04-06 16:01 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @code_native Threads.atomic_add!(Threads.Atomic{Int}(1), 2)

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	"_julia_atomic_add!_11581"      ; -- Begin function julia_atomic_add!_11581
	.p2align	2
"_julia_atomic_add!_11581":             ; @"julia_atomic_add!_11581"
; ┌ @ atomics.jl:405 within `atomic_add!`
	.cfi_startproc
; %bb.0:                                ; %top
	ldaddal	x1, x0, [x0]
	ret
	.cfi_endproc
; └
                                        ; -- End function
.subsections_via_symbols

ViralBShah added the system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips label May 19, 2021

yuyichao mentioned this issue Aug 16, 2021

Cpu feature detection issues M1 mac #41895

Closed

gbaraldi mentioned this issue Aug 18, 2021

Add feature detection for ARM/MacOS #41924

Merged

dnadlinger mentioned this issue Jan 5, 2022

Darwin/ARM64: Julia freezes on nested @threads loops #41820

Closed

oscardssmith closed this as completed Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/CPU Detection for Apple M1 #40876

Feature/CPU Detection for Apple M1 #40876

chriselrod commented May 19, 2021

gbaraldi commented Aug 18, 2021

giordano commented Apr 6, 2022

Feature/CPU Detection for Apple M1 #40876

Feature/CPU Detection for Apple M1 #40876

Comments

chriselrod commented May 19, 2021

gbaraldi commented Aug 18, 2021

giordano commented Apr 6, 2022