Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

Merged
merged 6 commits into from
Oct 29, 2024
Merged

[MM-59980] Upgrade to whisper.cpp v1.7.1 #33

merged 6 commits into from
Oct 29, 2024

Conversation

streamer45
Copy link
Contributor

Summary

Interestingly, benchmarks revealed that AVX512 wasn't adding much value. In fact, binaries compiled with that instruction set performed worse on average.

Since we are here, we also bump whisper.cpp to the latest version (v1.7.1), which shows a reasonable performance boost (~5% on tiny, ~3.5% on base).

This also means we'll be able to get rid of https://github.com/mattermost/mattermost-plugin-calls/blob/1720eb5ab348bb358869369045b194f8c461a84b/.github/workflows/e2e.yml#L55-L80 soon enough.

Ticket Link

https://mattermost.atlassian.net/browse/MM-59980

@streamer45 streamer45 added 2: Dev Review Requires review by a core committer Do Not Merge Should not be merged until this label is removed labels Oct 23, 2024
@streamer45 streamer45 added this to the v0.5.0 milestone Oct 23, 2024
@streamer45 streamer45 requested a review from cpoile October 23, 2024 22:12
@streamer45 streamer45 self-assigned this Oct 23, 2024
cpoile
cpoile previously approved these changes Oct 24, 2024
Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigation, thank you!

@agnivade
Copy link
Member

Very high data widths are very sensitive to data layout and L1/L2 cache invalidations. If the code is not perfectly laid out by the compiler, it can have the opposite effect of CPUs having more cache-misses. I'd be really curious if you can run a perf on the same tests with AVX 512 enabled/disabled, and look at the CPU stats.

# Various basic CPU statistics, system wide, for the specified command:
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -command

# Various CPU level 1 data cache statistics for the specified command:
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command

# Various CPU data TLB statistics for the specified command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command

# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command

@streamer45
Copy link
Contributor Author

@agnivade I'll do it as I share your spirit of curiosity, but please don't make me mess with "how the code is laid out by the compiler" on this one :P

Some background discussion at ggerganov/whisper.cpp#2099 has some pointers as well.

@agnivade
Copy link
Member

but please don't make me mess with "how the code is laid out by the compiler" on this one :P

lol, I'm not a monster :)

@streamer45
Copy link
Contributor Author

streamer45 commented Oct 24, 2024

You know, none of those events exists on EC2 since it's virtualized, so it's going to be harder to get this data without going metal but then it wouldn't be a fair (or very useful) comparison.

The best thing would be to use what's available:

Metric Groups:

BrMispredicts:
  IpMispredict
       [Number of Instructions per non-speculative Branch Misprediction (JEClear)]
Branches:
  BpTkBranch
       [Branch instructions per taken branch]
  IpBranch
       [Instructions per Branch (lower number means higher occurrence rate)]
  IpCall
       [Instructions per (near) call (lower number means higher occurrence rate)]
  IpFarBranch
       [Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]]
  IpTB
       [Instruction per taken branch]
CacheMisses:
  L1MPKI
       [L1 cache true misses per kilo instruction for retired demand loads]
  L2MPKI
       [L2 cache true misses per kilo instruction for retired demand loads]
  L2MPKI_All
       [L2 cache misses per kilo instruction for all request types (including speculative)]
  L3MPKI
       [L3 cache true misses per kilo instruction for retired demand loads]
DSB:
  DSB_Coverage
       [Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)]
FetchBW:
  DSB_Coverage
       [Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)]
  IpTB
       [Instruction per taken branch]
Flops:
  FLOPc
       [Floating Point Operations Per Cycle]
  GFLOPs
       [Giga Floating Point Operations Per Second]
  IpFLOP
       [Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)]
FpArith:
  IpFLOP
       [Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)]
HPC:
  CPU_Utilization
       [Average CPU Utilization]
  DRAM_BW_Use
       [Average external Memory Bandwidth Use for reads and writes [GB / sec]]
  GFLOPs
       [Giga Floating Point Operations Per Second]
InsType:
  IpBranch
       [Instructions per Branch (lower number means higher occurrence rate)]
  IpFLOP
       [Instructions per Floating Point (FP) Operation (lower number means higher occurrence rate)]
  IpLoad
       [Instructions per Load (lower number means higher occurrence rate)]
  IpStore
       [Instructions per Store (lower number means higher occurrence rate)]
IoBW:
  IO_Read_BW
       [Average IO (network or disk) Bandwidth Use for Reads [GB / sec]]
  IO_Write_BW
       [Average IO (network or disk) Bandwidth Use for Writes [GB / sec]]
L2Evicts:
  L2_Evictions_NonSilent_PKI
       [Rate of non silent evictions from the L2 cache per Kilo instruction]
  L2_Evictions_Silent_PKI
       [Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)]
LSD:
  LSD_Coverage
       [Fraction of Uops delivered by the LSD (Loop Stream Detector; aka Loop Cache)]
MemoryBW:
  DRAM_BW_Use
       [Average external Memory Bandwidth Use for reads and writes [GB / sec]]
  L1D_Cache_Fill_BW
       [Average data fill bandwidth to the L1 data cache [GB / sec]]
  L2_Cache_Fill_BW
       [Average data fill bandwidth to the L2 cache [GB / sec]]
  L3_Cache_Access_BW
       [Average per-core data access bandwidth to the L3 cache [GB / sec]]
  L3_Cache_Fill_BW
       [Average per-core data fill bandwidth to the L3 cache [GB / sec]]
  MEM_Parallel_Reads
       [Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches]
  MLP
       [Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)]
  PMM_Read_BW
       [Average 3DXP Memory Bandwidth Use for reads [GB / sec]]
  PMM_Write_BW
       [Average 3DXP Memory Bandwidth Use for Writes [GB / sec]]
MemoryBound:
  Load_Miss_Real_Latency
       [Actual Average Latency for L1 data-cache miss demand loads (in core cycles)]
  MLP
       [Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)]
MemoryLat:
  Load_Miss_Real_Latency
       [Actual Average Latency for L1 data-cache miss demand loads (in core cycles)]
  MEM_PMM_Read_Latency
       [Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches]
  MEM_Read_Latency
       [Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches]
MemoryTLB:
  Page_Walks_Utilization
       [Utilization of the core's Page Walker(s) serving STLB misses triggered by instruction/Load/Store accesses]
OS:
  IpFarBranch
       [Instructions per Far Branch ( Far Branches apply upon transition from application to operating system, handling interrupts, exceptions) [lower number means higher occurrence rate]]
  Kernel_Utilization
       [Fraction of cycles spent in the Operating System (OS) Kernel mode]
Offcore:
  L2MPKI_All
       [L2 cache misses per kilo instruction for all request types (including speculative)]
  L3_Cache_Access_BW
       [Average per-core data access bandwidth to the L3 cache [GB / sec]]
PGO:
  BpTkBranch
       [Branch instructions per taken branch]
  IpTB
       [Instruction per taken branch]
Pipeline:
  CLKS
       [Per-Logical Processor actual clocks when the Logical Processor is active]
  CPI
       [Cycles Per Instruction (per Logical Processor)]
  ILP
       [Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)]
  UPI
       [Uops Per Instruction]
PortsUtil:
  ILP
       [Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)]
Power:
  Average_Frequency
       [Measured Average Frequency for unhalted processors [GHz]]
  C1_Core_Residency
       [C1 residency percent per core]
  C2_Pkg_Residency
       [C2 residency percent per package]
  C6_Core_Residency
       [C6 residency percent per core]
  C6_Pkg_Residency
       [C6 residency percent per package]
  Turbo_Utilization
       [Average Frequency Utilization relative nominal frequency]
Retire:
  UPI
       [Uops Per Instruction]
SMT:
  CORE_CLKS
       [Core actual clocks when any Logical Processor is active on the Physical Core]
  CoreIPC
       [Instructions Per Cycle (per physical core)]
  SMT_2T_Utilization
       [Fraction of cycles where both hardware Logical Processors were active]
Server:
  IO_Read_BW
       [Average IO (network or disk) Bandwidth Use for Reads [GB / sec]]
  IO_Write_BW
       [Average IO (network or disk) Bandwidth Use for Writes [GB / sec]]
  L2_Evictions_NonSilent_PKI
       [Rate of non silent evictions from the L2 cache per Kilo instruction]
  L2_Evictions_Silent_PKI
       [Rate of silent evictions from the L2 cache per Kilo instruction where the evicted lines are dropped (no writeback to L3 or memory)]
  MEM_PMM_Read_Latency
       [Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches]
  PMM_Read_BW
       [Average 3DXP Memory Bandwidth Use for reads [GB / sec]]
  PMM_Write_BW
       [Average 3DXP Memory Bandwidth Use for Writes [GB / sec]]
SoC:
  DRAM_BW_Use
       [Average external Memory Bandwidth Use for reads and writes [GB / sec]]
  IO_Read_BW
       [Average IO (network or disk) Bandwidth Use for Reads [GB / sec]]
  IO_Write_BW
       [Average IO (network or disk) Bandwidth Use for Writes [GB / sec]]
  MEM_PMM_Read_Latency
       [Average latency of data read request to external 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-read prefetches]
  MEM_Parallel_Reads
       [Average number of parallel data read requests to external memory. Accounts for demand loads and L1/L2 prefetches]
  MEM_Read_Latency
       [Average latency of data read request to external memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches]
  PMM_Read_BW
       [Average 3DXP Memory Bandwidth Use for reads [GB / sec]]
  PMM_Write_BW
       [Average 3DXP Memory Bandwidth Use for Writes [GB / sec]]
  Socket_CLKS
       [Socket actual clocks when any core is active on that socket]
Summary:
  Average_Frequency
       [Measured Average Frequency for unhalted processors [GHz]]
  CPU_Utilization
       [Average CPU Utilization]
  IPC
       [Instructions Per Cycle (per Logical Processor)]
  Instructions
       [Total number of retired Instructions, Sample with: INST_RETIRED.PREC_DIST]
TmaL1:
  CoreIPC
       [Instructions Per Cycle (per physical core)]
  Instructions
       [Total number of retired Instructions, Sample with: INST_RETIRED.PREC_DIST]

Well, scratch that, none of it works either so I think we are stuck in terms of benchmarking this on EC2.

@streamer45 streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Oct 24, 2024
@agnivade
Copy link
Member

Ah no, I was suggesting to benchmark on your laptop. You can use perflock: https://github.com/aclements/perflock to run benchmarks reliably on your laptop. But it would also help if you shutdown browser and code editor.

@agnivade
Copy link
Member

Anyways, please feel free to ignore if it's too much work. It was just a curiosity from my side.

@streamer45
Copy link
Contributor Author

@agnivade Here you go :)

AVX512=0
~/build/whisper.cpp-1.7.1 » sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches ./bench -m ../whisper.cpp/models/ggml-base.bin
whisper_init_from_file_with_params_no_state: loading model from '../whisper.cpp/models/ggml-base.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

whisper_print_timings:     load time =   108.51 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1262.89 ms /     1 runs ( 1262.89 ms per run)
whisper_print_timings:   decode time =   887.87 ms /   256 runs (    3.47 ms per run)
whisper_print_timings:   batchd time =   524.17 ms /   320 runs (    1.64 ms per run)
whisper_print_timings:   prompt time =  5729.83 ms /  4096 runs (    1.40 ms per run)
whisper_print_timings:    total time =  8405.77 ms

If you wish, you can submit these results here:

https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

- CPU model
- Operating system
- Compiler


Performance counter stats for './bench -m ../whisper.cpp/models/ggml-base.bin':

 123,711,448,912      cycles                                                        (23.15%)
 288,429,213,992      instructions              #    2.33  insn per cycle           (30.86%)
   2,002,111,100      cache-references                                              (38.54%)
     791,427,474      cache-misses              #   39.530 % of all cache refs      (46.22%)
   1,559,416,635      bus-cycles                                                    (53.89%)
 120,286,545,420      L1-dcache-loads                                               (61.58%)
   4,986,708,700      L1-dcache-load-misses     #    4.15% of all L1-dcache accesses  (69.28%)
   8,335,183,396      L1-dcache-stores                                              (69.26%)
 120,449,646,259      dTLB-loads                                                    (69.20%)
       4,317,144      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (30.75%)
 <not supported>      dTLB-prefetch-misses                                        
     270,361,766      LLC-loads                                                     (30.74%)
     136,775,835      LLC-load-misses           #   50.59% of all LL-cache accesses  (30.78%)
     183,504,847      LLC-stores                                                    (15.45%)
 <not supported>      LLC-prefetches                                              

    10.326926261 seconds time elapsed

    40.482704000 seconds user
     0.235992000 seconds sys
AVX512=1
~/build/whisper.cpp-1.7.1 » sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches ./bench -m ../whisper.cpp/models/ggml-base.bin
whisper_init_from_file_with_params_no_state: loading model from '../whisper.cpp/models/ggml-base.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

whisper_print_timings:     load time =   101.64 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1275.62 ms /     1 runs ( 1275.62 ms per run)
whisper_print_timings:   decode time =   926.56 ms /   256 runs (    3.62 ms per run)
whisper_print_timings:   batchd time =   580.10 ms /   320 runs (    1.81 ms per run)
whisper_print_timings:   prompt time =  6189.10 ms /  4096 runs (    1.51 ms per run)
whisper_print_timings:    total time =  8972.46 ms

If you wish, you can submit these results here:

https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

- CPU model
- Operating system
- Compiler


Performance counter stats for './bench -m ../whisper.cpp/models/ggml-base.bin':

 117,831,968,199      cycles                                                        (23.14%)
 189,136,870,811      instructions              #    1.61  insn per cycle           (30.84%)
   1,967,539,345      cache-references                                              (38.53%)
     795,283,432      cache-misses              #   40.420 % of all cache refs      (46.22%)
   1,619,560,927      bus-cycles                                                    (53.91%)
  68,150,388,071      L1-dcache-loads                                               (61.61%)
   4,973,845,055      L1-dcache-load-misses     #    7.30% of all L1-dcache accesses  (69.31%)
   6,796,876,636      L1-dcache-stores                                              (69.27%)
  68,504,710,729      dTLB-loads                                                    (69.21%)
       3,433,537      dTLB-load-misses          #    0.01% of all dTLB cache accesses  (30.71%)
 <not supported>      dTLB-prefetch-misses                                        
     286,682,143      LLC-loads                                                     (30.74%)
     168,301,708      LLC-load-misses           #   58.71% of all LL-cache accesses  (30.77%)
     183,825,736      LLC-stores                                                    (15.44%)
 <not supported>      LLC-prefetches                                              

    10.807231840 seconds time elapsed

    41.965993000 seconds user
     0.324169000 seconds sys

@streamer45
Copy link
Contributor Author

Verified the image built is working on an instance without AVX512 support.

@streamer45 streamer45 requested a review from cpoile October 29, 2024 21:24
@streamer45 streamer45 removed the Do Not Merge Should not be merged until this label is removed label Oct 29, 2024
@streamer45 streamer45 merged commit 81a9db5 into master Oct 29, 2024
2 checks passed
@streamer45 streamer45 deleted the MM-59980 branch October 29, 2024 21:37
@agnivade
Copy link
Member

agnivade commented Nov 6, 2024

Sorry, just getting to this now.

So, as expected the IPC decreases from 2.33 to 1.61, which is theoretically a good thing. But we can see where the problem is:

  1. The L1-dcache-load-misses increases from 4% to 7%. This is most probably because the code is not aligned properly leading to more cache misses.
  2. Also, the LLC-load-misses increases from 50% to 58%. This is just a continuation from before. So both L1 and L3 caches are being missed, leading to more time loss.

@streamer45
Copy link
Contributor Author

Thanks, that makes sense. The perf output was actually color-coded to highlight the increase in cache misses. Good stuff to keep in mind if we ever need to think about this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants