「简体中文」|「English」
SenseVoice是具有音频理解能力的音频基础模型, 包括语音识别(ASR)、语种识别(LID)、语音情感识别(SER)和声学事件分类(AEC)或声学事件检测(AED)。 当前SenseVoice-small支持中、粤、英、日、韩语的多语言语音识别,情感识别和事件检测能力,具有极低的推理延迟。
本项目基于ggml推理框架。
- 基于ggml,不依赖其他第三方库, 致力于端侧部署
- 特征提取参考kaldi-native-fbank库,支持多线程特征提取。
- 支持flash attention解码
- 支持Q3, Q4, Q5, Q6, Q8量化
- 支持更多backend , 理论上来说,ggml支持以下后端,后续将会慢慢适配,欢迎贡献。
后端 | 平台 | 是否支持 |
---|---|---|
CPU | All | ✅ |
Metal | Apple Silicon | ✅ |
BLAS | All | ✅ |
CUDA | Nvidia GPU | ✅ |
Vulkan | GPU | ✅ |
Cann | Ascend NPU | 未测试 |
BLIS | All | |
SYCL | Intel and Nvidia GPU | |
MUSA | Moore Threads GPU | |
hipBLAS | AMD GPU |
- 提升性能
- 修bug
可以直接从下面链接下载模型
git lfs install
git clone https://huggingface.co/lovemefan/sense-voice-gguf.git
# 或从modelscope下载
git clone https://www.modelscope.cn/models/lovemefan/SenseVoiceGGUF.git
或许自行下载官方模型转换
# 下载官方模型
git lfs install
git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git
# 转换模型
python scripts/convert-pt-to-gguf.py \
--model SenseVoiceSmall \
--output /path/to/export/gguf-fp32-sense-voice-small.bin \
--out_type f32
git clone https://github.com/lovemefan/SenseVoice.cpp
cd SenseVoice.cpp
git submodule sync && git submodule update --init --recursive
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 8
# -t means thread num, -t 指定线程数
./bin/sense-voice-main -m /path/gguf-fp16-sense-voice-small.bin /path/asr_example_zh.wav -t 4 -ng
当前使用sense-voice-f16模型输出
$./bin/sense-voice-main -m /data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice.bin /data/code/SenseVoice.cpp/scripts/resources/SenseVoiceSmall/example/asr_example_zh.wav -t 4
sense_voice_small_init_from_file_with_params_no_state: loading model from '/data/code/SenseVoice.cpp/scripts/resources/gguf-fp16-sense-voice-small.bin'
sense_voice_model_load: version: 3
sense_voice_model_load: alignment: 32
sense_voice_model_load: data offset: 444480
sense_voice_model_load: loading model
sense_voice_model_load: n_vocab = 25055
sense_voice_model_load: n_encoder_hidden_state = 512
sense_voice_model_load: n_encoder_linear_units = 2048
sense_voice_model_load: n_encoder_attention_heads = 4
sense_voice_model_load: n_encoder_layers = 50
sense_voice_model_load: n_mels = 80
sense_voice_model_load: ftype = 1
sense_voice_model_load: vocab[25055] loaded
sense_voice_model_load: CPU total size = 468.98 MB
sense_voice_model_load: n_tensors: 1197
sense_voice_model_load: load SenseVoiceSmall takes 0.213000 second
sense_voice_init_state: compute buffer (encoder) = 50.40 MB
sense_voice_init_state: compute buffer (decoder) = 13.72 MB
system_info: n_threads = 4 / 256 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0
main: processing audio (88747 samples, 5.54669 sec) , 4 threads, 1 processors, lang = auto...
sense_voice_pcm_to_feature_with_state: calculate fbank and cmvn takes 7.207 ms
<|zh|><|NEUTRAL|><|Speech|><|withitn|>欢迎大家来体验达摩院推出的语音识别模型。
sense_voice_full_with_state: decoder audio use 1.011289 s, rtf is 0.182323.
- 本项目借用并模仿来自whisper.cpp 的大部分c++代码
- 参考来自funasr的paraformer模型结构以及前向计算 FunASR
- 本项目参考并借用 kaldi-native-fbank中的fbank特征提取算法。 FunASR 中的lrf + cmvn 算法
- 借用了大量的前期工作paraformer.cpp, paraformer.cpp项目后续将继续更新