快速开始 ================== .. note:: 阅读本篇前,请确保已按照 :doc:`安装教程 <./install>` 准备好昇腾环境及 Whisper.cpp ! 本文档帮助昇腾开发者快速使用 Whisper.cpp × 昇腾 进行自动语音识别(Automatic Speech Recognition, ASR)。 Whisper 模型下载 --------------------- Whisper 模型是 OpenAI 训练并开源的 ASR 神经网络模型,是当前 ASR 领域主流模型之一。 在 Whisper.cpp 中进行语音识别,需要下载 Whisper 模型并加载其 gguf 格式权重文件。 本文提供三种模型的获取方式,请根据需要选择一种即可。 .. note:: gguf 是一种储存神经网络权重的文件格式,是一种二进制格式,旨在快速加载和保存模型,详见 `ggml 官方文档 `_ 1. 使用脚本下载 ~~~~~~~~~~~~~~~~~~~~ 使用 Whisper.cpp 项目中的 ``download-ggml-model.sh`` 脚本下载预先转换为 gguf 格式的 Whisper 模型: .. code-block:: shell :linenos: ./download-ggml-model.sh base.en 其中 ``base.en`` 可替换为所需 Whisper 模型名称,Whisper 模型名称清单: .. code-block:: python :linenos: # Whisper models models="tiny tiny.en tiny-q5_1 tiny.en-q5_1 base base.en base-q5_1 base.en-q5_1 small small.en small.en-tdrz small-q5_1 small.en-q5_1 medium medium.en medium-q5_0 medium.en-q5_0 large-v1 large-v2 large-v2-q5_0 large-v3 large-v3-q5_0" 2. 手动下载 ~~~~~~~~~~~~~~~~~~~~ 预先转换为 gguf 格式的 Whisper 模型可由此处下载: - https://huggingface.co/ggerganov/whisper.cpp/tree/main - https://ggml.ggerganov.com 3. 自行转换模型 ~~~~~~~~~~~~~~~~~~~~ 从 `OpenAI 提供的模型 `_ 中选择一个下载,使用以下指令完成其到 gguf 模型的转换,并将其移动至 ``./models/`` 目录下: .. code-block:: shell :linenos: python models/convert-pt-to-ggml.py ~/.cache/whisper/medium.pt ~/path/to/repo/whisper/ ./models/whisper-medium mv ./models/whisper-medium/ggml-model.bin models/ggml-medium.bin 语音文件预处理 --------------------- 使用 ffmpeg 转换所需处理的语音文件为 16 bit wav 语音文件,此处以 ``samples/gb0.ogg`` 为例: .. code-block:: shell :linenos: ffmpeg -loglevel -0 -y -i samples/gb0.ogg -ar 16000 -ac 1 -c:a pcm_s16le samples/gb0.wav 自动语音识别 --------------------- 使用以下指令,即可完成在昇腾 NPU 上的 Whisper.cpp 自动语音识别: .. code-block:: shell :linenos: ./build/bin/main -f samples/jfk.wav -m models/ggml-base.en.bin -t 8 输出语音识别结果与对应语音内容一致表明识别正确,以下为 ``samples/jfk.wav`` 语音的正确回显示例: .. code-block:: shell whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_init_with_params_no_state: use gpu = 1 whisper_init_with_params_no_state: flash attn = 0 whisper_init_with_params_no_state: gpu_device = 0 whisper_init_with_params_no_state: dtw = 0 whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 whisper_model_load: CPU total size = 147.37 MB whisper_model_load: model size = 147.37 MB whisper_backend_init_gpu: using CANN backend whisper_init_state: kv self size = 18.87 MB whisper_init_state: kv cross size = 18.87 MB whisper_init_state: kv pad size = 3.15 MB whisper_init_state: compute buffer (conv) = 16.75 MB whisper_init_state: compute buffer (encode) = 131.94 MB whisper_init_state: compute buffer (cross) = 5.17 MB whisper_init_state: compute buffer (decode) = 153.13 MB system_info: n_threads = 8 / 192 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 1 main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country. whisper_print_timings: load time = 223.83 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 19.95 ms whisper_print_timings: sample time = 94.43 ms / 131 runs ( 0.72 ms per run) whisper_print_timings: encode time = 632.05 ms / 1 runs ( 632.05 ms per run) whisper_print_timings: decode time = 56.30 ms / 2 runs ( 28.15 ms per run) whisper_print_timings: batchd time = 930.68 ms / 125 runs ( 7.45 ms per run) whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: total time = 2854.32 ms