[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

streamer45 · 2024-10-23T22:12:47Z

Summary

Interestingly, benchmarks revealed that AVX512 wasn't adding much value. In fact, binaries compiled with that instruction set performed worse on average.

Since we are here, we also bump whisper.cpp to the latest version (v1.7.1), which shows a reasonable performance boost (~5% on tiny, ~3.5% on base).

This also means we'll be able to get rid of https://github.com/mattermost/mattermost-plugin-calls/blob/1720eb5ab348bb358869369045b194f8c461a84b/.github/workflows/e2e.yml#L55-L80 soon enough.

Ticket Link

https://mattermost.atlassian.net/browse/MM-59980

cpoile

Nice investigation, thank you!

agnivade · 2024-10-24T04:30:09Z

Very high data widths are very sensitive to data layout and L1/L2 cache invalidations. If the code is not perfectly laid out by the compiler, it can have the opposite effect of CPUs having more cache-misses. I'd be really curious if you can run a perf on the same tests with AVX 512 enabled/disabled, and look at the CPU stats.

# Various basic CPU statistics, system wide, for the specified command:
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -command

# Various CPU level 1 data cache statistics for the specified command:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command

# Various CPU data TLB statistics for the specified command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command

# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command

streamer45 · 2024-10-24T13:57:37Z

@agnivade I'll do it as I share your spirit of curiosity, but please don't make me mess with "how the code is laid out by the compiler" on this one :P

Some background discussion at ggerganov/whisper.cpp#2099 has some pointers as well.

agnivade · 2024-10-24T15:47:10Z

but please don't make me mess with "how the code is laid out by the compiler" on this one :P

lol, I'm not a monster :)

streamer45 · 2024-10-24T18:31:38Z

You know, none of those events exists on EC2 since it's virtualized, so it's going to be harder to get this data without going metal but then it wouldn't be a fair (or very useful) comparison.

~~The best thing would be to use what's available:~~

Metric Groups:

BrMispredicts:
  IpMispredict
       [Number of Instructions per non-speculative Branch Misprediction (JEClear)]
Branches:
  BpTkBranch
       [Branch instructions per taken branch]
  IpBranch
       [Instructions per Branch (lower number means higher occurrence rate)]
  IpCall
       [Instructions per (near) call (lower number means higher occurrence rate)]
  IpFarBranch
       [Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]]
  IpTB
       [Instruction per taken branch]
CacheMisses:
  L1MPKI
       [L1 cache true misses per kilo instruction for retired demand loads]
  L2MPKI
       [L2 cache true misses per kilo instruction for retired demand loads]
  L2MPKI_All
       [L2 cache misses per kilo instruction for all request types (including speculative)]
  L3MPKI
       [L3 cache true misses per kilo instruction for retired demand loads]
DSB:
  DSB_Coverage
       [Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)]
FetchBW:
  DSB_Coverage
       [Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)]
  IpTB
       [Instruction per taken branch]
Flops:
  FLOPc
       [Floating Point Operations Per Cycle]
  GFLOPs
       [Giga Floating Point Operations Per Second]
  IpFLOP
       [Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)]
FpArith:
  IpFLOP
       [Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)]
HPC:
  CPU_Utilization
       [Average CPU Utilization]
  DRAM_BW_Use
       [Average external Memory Bandwidth Use for reads and writes [GB / sec]]
  GFLOPs
       [Giga Floating Point Operations Per Second]
InsType:
  IpBranch
       [Instructions per Branch (lower number means higher occurrence rate)]
  IpFLOP
       [Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)]
  IpLoad
       [Instructions per Load (lower number means higher occurrence rate)]
  IpStore
       [Instructions per Store (lower number means higher occurrence rate)]
IoBW:
  IO_Read_BW
       [Average IO (network or disk) Bandwidth Use for Reads [GB / sec]]
  IO_Write_BW
       [Average IO (network or disk) Bandwidth Use for Writes [GB / sec]]
L2Evicts:
  L2_Evictions_NonSilent_PKI
       [Rate of non silent evictions from the L2 cache per Kilo instruction]
  L2_Evictions_Silent_PKI
       [Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)]
LSD:
  LSD_Coverage
       [Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)]
MemoryBW:
  DRAM_BW_Use
       [Average external Memory Bandwidth Use for reads and writes [GB / sec]]
  L1D_Cache_Fill_BW
       [Average data fill bandwidth to the L1 data cache [GB / sec]]
  L2_Cache_Fill_BW
       [Average data fill bandwidth to the L2 cache [GB / sec]]
  L3_Cache_Access_BW
       [Average per-core data access bandwidth to the L3 cache [GB / sec]]
  L3_Cache_Fill_BW
       [Average per-core data fill bandwidth to the L3 cache [GB / sec]]
  MEM_Parallel_Reads
       [Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches]
  MLP
       [Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)]
  PMM_Read_BW
       [Average 3DXP Memory Bandwidth Use for reads [GB / sec]]
  PMM_Write_BW
       [Average 3DXP Memory Bandwidth Use for Writes [GB / sec]]
MemoryBound:
  Load_Miss_Real_Latency
       [Actual Average Latency for L1 data-cache miss demand loads (in core cycles)]
  MLP
       [Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)]
MemoryLat:
  Load_Miss_Real_Latency
       [Actual Average Latency for L1 data-cache miss demand loads (in core cycles)]
  MEM_PMM_Read_Latency
       [Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches]
  MEM_Read_Latency
       [Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches]
MemoryTLB:
  Page_Walks_Utilization
       [Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses]
OS:
  IpFarBranch
       [Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]]
  Kernel_Utilization
       [Fraction of cycles spent in the Operating System (OS) Kernel mode]
Offcore:
  L2MPKI_All
       [L2 cache misses per kilo instruction for all request types (including speculative)]
  L3_Cache_Access_BW
       [Average per-core data access bandwidth to the L3 cache [GB / sec]]
PGO:
  BpTkBranch
       [Branch instructions per taken branch]
  IpTB
       [Instruction per taken branch]
Pipeline:
  CLKS
       [Per-Logical Processor actual clocks when the Logical Processor is active]
  CPI
       [Cycles Per Instruction (per Logical Processor)]
  ILP
       [Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)]
  UPI
       [Uops Per Instruction]
PortsUtil:
  ILP
       [Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)]
Power:
  Average_Frequency
       [Measured Average Frequency for unhalted processors [GHz]]
  C1_Core_Residency
       [C1 residency percent per core]
  C2_Pkg_Residency
       [C2 residency percent per package]
  C6_Core_Residency
       [C6 residency percent per core]
  C6_Pkg_Residency
       [C6 residency percent per package]
  Turbo_Utilization
       [Average Frequency Utilization relative nominal frequency]
Retire:
  UPI
       [Uops Per Instruction]
SMT:
  CORE_CLKS
       [Core actual clocks when any Logical Processor is active on the Physical Core]
  CoreIPC
       [Instructions Per Cycle (per physical core)]
  SMT_2T_Utilization
       [Fraction of cycles where both hardware Logical Processors were active]
Server:
  IO_Read_BW
       [Average IO (network or disk) Bandwidth Use for Reads [GB / sec]]
  IO_Write_BW
       [Average IO (network or disk) Bandwidth Use for Writes [GB / sec]]
  L2_Evictions_NonSilent_PKI
       [Rate of non silent evictions from the L2 cache per Kilo instruction]
  L2_Evictions_Silent_PKI
       [Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)]
  MEM_PMM_Read_Latency
       [Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches]
  PMM_Read_BW
       [Average 3DXP Memory Bandwidth Use for reads [GB / sec]]
  PMM_Write_BW
       [Average 3DXP Memory Bandwidth Use for Writes [GB / sec]]
SoC:
  DRAM_BW_Use
       [Average external Memory Bandwidth Use for reads and writes [GB / sec]]
  IO_Read_BW
       [Average IO (network or disk) Bandwidth Use for Reads [GB / sec]]
  IO_Write_BW
       [Average IO (network or disk) Bandwidth Use for Writes [GB / sec]]
  MEM_PMM_Read_Latency
       [Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches]
  MEM_Parallel_Reads
       [Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches]
  MEM_Read_Latency
       [Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches]
  PMM_Read_BW
       [Average 3DXP Memory Bandwidth Use for reads [GB / sec]]
  PMM_Write_BW
       [Average 3DXP Memory Bandwidth Use for Writes [GB / sec]]
  Socket_CLKS
       [Socket actual clocks when any core is active on that socket]
Summary:
  Average_Frequency
       [Measured Average Frequency for unhalted processors [GHz]]
  CPU_Utilization
       [Average CPU Utilization]
  IPC
       [Instructions Per Cycle (per Logical Processor)]
  Instructions
       [Total number of retired Instructions, Sample with: INST_RETIRED.PREC_DIST]
TmaL1:
  CoreIPC
       [Instructions Per Cycle (per physical core)]
  Instructions
       [Total number of retired Instructions, Sample with: INST_RETIRED.PREC_DIST]

Well, scratch that, none of it works either so I think we are stuck in terms of benchmarking this on EC2.

agnivade · 2024-10-25T03:44:25Z

Ah no, I was suggesting to benchmark on your laptop. You can use perflock: https://github.com/aclements/perflock to run benchmarks reliably on your laptop. But it would also help if you shutdown browser and code editor.

agnivade · 2024-10-25T03:44:50Z

Anyways, please feel free to ignore if it's too much work. It was just a curiosity from my side.

streamer45 · 2024-10-25T20:33:00Z

@agnivade Here you go :)

AVX512=0

~/build/whisper.cpp-1.7.1 » sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches ./bench -m ../whisper.cpp/models/ggml-base.bin
whisper_init_from_file_with_params_no_state: loading model from '../whisper.cpp/models/ggml-base.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

whisper_print_timings:     load time =   108.51 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1262.89 ms /     1 runs ( 1262.89 ms per run)
whisper_print_timings:   decode time =   887.87 ms /   256 runs (    3.47 ms per run)
whisper_print_timings:   batchd time =   524.17 ms /   320 runs (    1.64 ms per run)
whisper_print_timings:   prompt time =  5729.83 ms /  4096 runs (    1.40 ms per run)
whisper_print_timings:    total time =  8405.77 ms

If you wish, you can submit these results here:

https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

- CPU model
- Operating system
- Compiler


Performance counter stats for './bench -m ../whisper.cpp/models/ggml-base.bin':

 123,711,448,912      cycles                                                        (23.15%)
 288,429,213,992      instructions              #    2.33  insn per cycle           (30.86%)
   2,002,111,100      cache-references                                              (38.54%)
     791,427,474      cache-misses              #   39.530 % of all cache refs      (46.22%)
   1,559,416,635      bus-cycles                                                    (53.89%)
 120,286,545,420      L1-dcache-loads                                               (61.58%)
   4,986,708,700      L1-dcache-load-misses     #    4.15% of all L1-dcache accesses  (69.28%)
   8,335,183,396      L1-dcache-stores                                              (69.26%)
 120,449,646,259      dTLB-loads                                                    (69.20%)
       4,317,144      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (30.75%)
 <not supported>      dTLB-prefetch-misses                                        
     270,361,766      LLC-loads                                                     (30.74%)
     136,775,835      LLC-load-misses           #   50.59% of all LL-cache accesses  (30.78%)
     183,504,847      LLC-stores                                                    (15.45%)
 <not supported>      LLC-prefetches                                              

    10.326926261 seconds time elapsed

    40.482704000 seconds user
     0.235992000 seconds sys

AVX512=1

~/build/whisper.cpp-1.7.1 » sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches ./bench -m ../whisper.cpp/models/ggml-base.bin
whisper_init_from_file_with_params_no_state: loading model from '../whisper.cpp/models/ggml-base.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

whisper_print_timings:     load time =   101.64 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1275.62 ms /     1 runs ( 1275.62 ms per run)
whisper_print_timings:   decode time =   926.56 ms /   256 runs (    3.62 ms per run)
whisper_print_timings:   batchd time =   580.10 ms /   320 runs (    1.81 ms per run)
whisper_print_timings:   prompt time =  6189.10 ms /  4096 runs (    1.51 ms per run)
whisper_print_timings:    total time =  8972.46 ms

If you wish, you can submit these results here:

https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

- CPU model
- Operating system
- Compiler


Performance counter stats for './bench -m ../whisper.cpp/models/ggml-base.bin':

 117,831,968,199      cycles                                                        (23.14%)
 189,136,870,811      instructions              #    1.61  insn per cycle           (30.84%)
   1,967,539,345      cache-references                                              (38.53%)
     795,283,432      cache-misses              #   40.420 % of all cache refs      (46.22%)
   1,619,560,927      bus-cycles                                                    (53.91%)
  68,150,388,071      L1-dcache-loads                                               (61.61%)
   4,973,845,055      L1-dcache-load-misses     #    7.30% of all L1-dcache accesses  (69.31%)
   6,796,876,636      L1-dcache-stores                                              (69.27%)
  68,504,710,729      dTLB-loads                                                    (69.21%)
       3,433,537      dTLB-load-misses          #    0.01% of all dTLB cache accesses  (30.71%)
 <not supported>      dTLB-prefetch-misses                                        
     286,682,143      LLC-loads                                                     (30.74%)
     168,301,708      LLC-load-misses           #   58.71% of all LL-cache accesses  (30.77%)
     183,825,736      LLC-stores                                                    (15.44%)
 <not supported>      LLC-prefetches                                              

    10.807231840 seconds time elapsed

    41.965993000 seconds user
     0.324169000 seconds sys

streamer45 · 2024-10-29T21:24:41Z

Verified the image built is working on an instance without AVX512 support.

agnivade · 2024-11-06T04:21:07Z

Sorry, just getting to this now.

So, as expected the IPC decreases from 2.33 to 1.61, which is theoretically a good thing. But we can see where the problem is:

The L1-dcache-load-misses increases from 4% to 7%. This is most probably because the code is not aligned properly leading to more cache misses.
Also, the LLC-load-misses increases from 50% to 58%. This is just a continuation from before. So both L1 and L3 caches are being missed, leading to more time loss.

streamer45 · 2024-11-06T13:56:22Z

Thanks, that makes sense. The perf output was actually color-coded to highlight the increase in cache misses. Good stuff to keep in mind if we ever need to think about this in the future.

streamer45 added 2 commits October 23, 2024 16:05

Avoid AV512 CPU extensions

fc292b6

Upgrade to whisper.cpp v1.7.1

1d92eb4

streamer45 added 2: Dev Review Requires review by a core committer Do Not Merge Should not be merged until this label is removed labels Oct 23, 2024

streamer45 added this to the v0.5.0 milestone Oct 23, 2024

streamer45 requested a review from cpoile October 23, 2024 22:12

streamer45 self-assigned this Oct 23, 2024

Go is not really needed in CI

5900069

streamer45 force-pushed the MM-59980 branch from f8b17ec to 5900069 Compare October 23, 2024 22:13

cpoile previously approved these changes Oct 24, 2024

View reviewed changes

streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Oct 24, 2024

Fix condition

fcb209e

streamer45 dismissed cpoile’s stale review via fcb209e October 24, 2024 22:53

streamer45 added 2 commits October 29, 2024 15:19

Update opus

8585973

Fix expression

34937cd

streamer45 requested a review from cpoile October 29, 2024 21:24

streamer45 removed the Do Not Merge Should not be merged until this label is removed label Oct 29, 2024

cpoile approved these changes Oct 29, 2024

View reviewed changes

streamer45 merged commit 81a9db5 into master Oct 29, 2024
2 checks passed

streamer45 deleted the MM-59980 branch October 29, 2024 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

streamer45 commented Oct 23, 2024

cpoile left a comment

agnivade commented Oct 24, 2024

streamer45 commented Oct 24, 2024

agnivade commented Oct 24, 2024

streamer45 commented Oct 24, 2024 •

edited

Loading

agnivade commented Oct 25, 2024

agnivade commented Oct 25, 2024

streamer45 commented Oct 25, 2024

streamer45 commented Oct 29, 2024

agnivade commented Nov 6, 2024

streamer45 commented Nov 6, 2024

[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

Conversation

streamer45 commented Oct 23, 2024

Summary

Ticket Link

cpoile left a comment

Choose a reason for hiding this comment

agnivade commented Oct 24, 2024

streamer45 commented Oct 24, 2024

agnivade commented Oct 24, 2024

streamer45 commented Oct 24, 2024 • edited Loading

agnivade commented Oct 25, 2024

agnivade commented Oct 25, 2024

streamer45 commented Oct 25, 2024

streamer45 commented Oct 29, 2024

agnivade commented Nov 6, 2024

streamer45 commented Nov 6, 2024

streamer45 commented Oct 24, 2024 •

edited

Loading