Whisper语音转文字

最近在做音频转录项目时，试了市面上几款语音识别服务，要么按时长收费太贵，要么准确率不够理想。后来发现 OpenAI 开源的 Whisper 模型，本地部署后效果惊艳——中英文混合识别准确率能达到 95% 以上，而且完全免费。

今天就把完整的部署流程和踩过的坑分享出来，让你 30 分钟内搞定本地语音转文字服务。

Whisper 是 OpenAI 在 2022 年开源的自动语音识别（ASR）模型，基于 68 万小时的多语言数据训练，支持 99 种语言的转录和翻译。

核心优势：

• 多语言支持强大，中英文识别准确率高
• 开源免费，可本地部署保护隐私
• 模型分级灵活（tiny/base/small/medium/large）
• 自动标点、时间戳、说话人识别

适用场景：

• 会议录音转文字
• 视频字幕生成
• 播客内容整理
• 语音笔记转录

环境准备#

系统要求#

1
# 推荐配置
2
- CPU: 4 核以上
3
- 内存: 8GB+（large 模型需 16GB+）
4
- 硬盘: 10GB 可用空间
5
- 系统: Windows/macOS/Linux

安装 Python 环境#

Whisper 需要 Python 3.8-3.11 版本：

1
# macOS/Linux
2
python3 --version
3

4
# Windows
5
python --version

如果没有安装，推荐使用 Anaconda：

1
# 创建虚拟环境
2
conda create -n whisper python=3.10
3
conda activate whisper

安装 FFmpeg#

Whisper 依赖 FFmpeg 处理音频文件：

1
# macOS
2
brew install ffmpeg
3

4
# Ubuntu/Debian
5
sudo apt update && sudo apt install ffmpeg
6

7
# Windows
8
# 下载 https://ffmpeg.org/download.html
9
# 解压后添加到系统 PATH

验证安装：

1
ffmpeg -version

安装 Whisper#

方式一：pip 安装（推荐）#

1
pip install -U openai-whisper

方式二：从源码安装#

1
pip install git+https://github.com/openai/whisper.git

安装加速库（强烈推荐）#

不管是 Windows/Linux 的 NVIDIA 显卡，还是 Mac 的 Apple Silicon 芯片，开启硬件加速都能大幅提速！

对于 Windows/Linux (NVIDIA 显卡)：

1
# CUDA 11.8
2
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
3

4
# CUDA 12.1
5
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

对于 macOS (Apple M1/M2/M3/M4/M5 芯片)：

Mac 自带 MPS（Metal Performance Shaders）硬件加速支持，直接安装即可：

1
pip install torch torchvision torchaudio

验证硬件加速可用（在命令行中直接运行）：

1
# 验证 CUDA (N卡) 或 MPS (Mac)
2
python3 -c 'import torch; print("CUDA 可用:", torch.cuda.is_available()); print("MPS 可用:", torch.backends.mps.is_available())'

模型选择#

Whisper 提供 5 种规模的模型，按需选择：

模型	参数量	显存占用	相对速度	英文 WER	多语言 WER
tiny	39M	~1GB	~32x	5.0%	12.1%
base	74M	~1GB	~16x	3.4%	8.4%
small	244M	~2GB	~6x	2.3%	5.4%
medium	769M	~5GB	~2x	1.7%	3.8%
large/large-v3	1550M	~10GB	1x	1.4%	3.0%

选择建议：

• 快速测试：tiny/base
• 日常使用：small（性价比最高）
• 专业场景：medium/large（特别是 large-v3，中英混排和口音识别极佳）

💡 Mac 用户特别提示：由于 Apple M 系列芯片采用统一内存架构，只要你的 Mac 内存大于 16GB（如 24GB 内存的 M5），跑起 ~10GB 的 large-v3 模型可谓游刃有余，配合 MPS 加速，强烈建议直接上 large 体验极致准确率！

首次运行会自动下载模型到 ~/.cache/whisper/。

基础使用#

命令行转录#

最简单的使用方式：

1
# 转录音频文件
2
whisper audio.mp3
3

4
# 指定模型
5
whisper audio.mp3 --model medium
6

7
# 指定语言（加速识别）
8
whisper audio.mp3 --language Chinese
9

10
# 输出字幕文件
11
whisper audio.mp3 --output_format srt
12

13
# 翻译成英文
14
whisper audio.mp3 --task translate

Python 脚本调用#

创建 transcribe.py：

1
import whisper
2

3
# 加载模型
4
model = whisper.load_model("large-v3-turbo")
5

6
# 转录音频
7
result = model.transcribe("../audio.mp3", language="zh")
8

9
# 输出结果
10
print(result["text"])
11

12
# 带时间戳的分段结果
13
for segment in result["segments"]:
14
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

运行：

1
python transcribe.py

批量处理脚本#

处理多个音频文件：

1
import whisper
2
import os
3
from pathlib import Path
4

5
model = whisper.load_model("large-v3-turbo")
6

7
# 音频文件夹
8
audio_dir = Path("./audios")
9
output_dir = Path("./transcripts")
10
output_dir.mkdir(exist_ok=True)
11

12
# 支持的格式
13
audio_formats = [".mp3", ".wav", ".m4a", ".flac"]
14

15
for audio_file in audio_dir.iterdir():
16
    if audio_file.suffix.lower() in audio_formats:
17
        print(f"正在处理: {audio_file.name}")
18

19
        result = model.transcribe(str(audio_file), language="zh")
20

21
        # 保存文本
22
        output_file = output_dir / f"{audio_file.stem}.txt"
23
        with open(output_file, "w", encoding="utf-8") as f:
24
            f.write(result["text"])
25

26
        print(f"完成: {output_file.name}\n")

进阶技巧#

1. 提升识别准确率#

1
result = model.transcribe(
2
    "audio.mp3",
3
    language="zh",
4
    initial_prompt="这是一段关于人工智能技术的讨论",  # 提供上下文
5
    temperature=0.0,  # 降低随机性
6
    beam_size=5,  # 增加束搜索宽度
7
    best_of=5,  # 多次采样取最佳
8
    fp16=False  # CPU 模式下禁用半精度
9
)

2. 处理长音频#

Whisper 默认处理 30 秒片段，长音频会自动分段：

1
# 调整分段参数
2
result = model.transcribe(
3
    "long_audio.mp3",
4
    verbose=True,  # 显示进度
5
    condition_on_previous_text=True,  # 利用上文提升连贯性
6
    compression_ratio_threshold=2.4,  # 过滤低质量片段
7
    logprob_threshold=-1.0  # 置信度阈值
8
)

3. 生成字幕文件#

1
import whisper
2

3
model = whisper.load_model("large-v3-turbo")
4
result = model.transcribe("video.mp4", language="zh")
5

6
# 生成 SRT 字幕
7
from whisper.utils import WriteSRT
8

9
with open("subtitle.srt", "w", encoding="utf-8") as srt:
10
    WriteSRT(output_dir=".")(result)

4. 实时语音识别#

结合 PyAudio 实现实时转录：

1
import whisper
2
import pyaudio
3
import wave
4
import tempfile
5

6
model = whisper.load_model("large-v3-turbo")
7

8
# 录音参数
9
CHUNK = 1024
10
FORMAT = pyaudio.paInt16
11
CHANNELS = 1
12
RATE = 16000
13
RECORD_SECONDS = 5
14

15
p = pyaudio.PyAudio()
16

17
stream = p.open(
18
    format=FORMAT,
19
    channels=CHANNELS,
20
    rate=RATE,
21
    input=True,
22
    frames_per_buffer=CHUNK
23
)
24

25
print("开始录音...")
26

27
while True:
28
    frames = []
29
    for _ in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
30
        data = stream.read(CHUNK)
31
        frames.append(data)
32

33
    # 保存临时文件
34
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio:
35
        wf = wave.open(temp_audio.name, 'wb')
36
        wf.setnchannels(CHANNELS)
37
        wf.setsampwidth(p.get_sample_size(FORMAT))
38
        wf.setframerate(RATE)
39
        wf.writeframes(b''.join(frames))
40
        wf.close()
41

42
        # 转录
43
        result = model.transcribe(temp_audio.name, language="zh")
44
        print(f"识别结果: {result['text']}")

性能优化#

硬件加速 (CUDA / Mac MPS)#

不要让你的 CPU 狂转！一定要利用显卡或 Apple Silicon 的硬件加速：

1
import torch
2
import whisper
3

4
# 自动检测可用硬件加速
5
if torch.cuda.is_available():
6
    device = "cuda"
7
    print(f"使用 NVIDIA GPU: {torch.cuda.get_device_name(0)}")
8
elif torch.backends.mps.is_available():
9
    device = "mps"
10
    print("使用 Apple M 系列芯片 MPS 加速")
11
else:
12
    device = "cpu"
13
    print("未检测到硬件加速，使用 CPU (速度较慢)")
14

15
# 加载模型到对应设备
16
model = whisper.load_model("large-v3-turbo", device=device)

量化加速#

使用 faster-whisper 获得 4 倍速度提升：

1
pip install faster-whisper
2
from faster_whisper import WhisperModel
3

4
# 使用 int8 量化
5
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8")
6

7
segments, info = model.transcribe("audio.mp3", language="zh")
8

9
for segment in segments:
10
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

批处理优化#

使用多进程处理大量文件：

1
from multiprocessing import Pool
2
import whisper
3

4
def transcribe_file(audio_path):
5
    model = whisper.load_model("large-v3-turbo")
6
    result = model.transcribe(audio_path, language="zh")
7
    return audio_path, result["text"]
8

9
if __name__ == "__main__":
10
    audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
11

12
    with Pool(processes=4) as pool:
13
        results = pool.map(transcribe_file, audio_files)
14

15
    for path, text in results:
16
        print(f"{path}: {text[:100]}...")

常见问题#

1. 中文识别不准确#

解决方案：

• 明确指定语言：--language Chinese
• 使用 medium 或 large 模型
• 提供 initial_prompt 上下文
• 确保音频质量清晰

2. 内存不足#

解决方案：

• 使用更小的模型（tiny/base）
• 启用 fp16 半精度：model.transcribe(..., fp16=True)
• 分段处理长音频
• 使用 faster-whisper 的量化版本

3. 速度太慢#

解决方案：

• 确保硬件加速已开启：Windows 安装 CUDA 版 PyTorch，Mac 确保调用了 device="mps"
• 使用 faster-whisper 替代
• 降低模型规模
• 明确指定语言避免自动检测

4. 时间戳不准确#

解决方案：

1
result = model.transcribe(
2
    "audio.mp3",
3
    word_timestamps=True,  # 启用词级时间戳
4
    prepend_punctuations="\"'"¿([{-",
5
    append_punctuations="\"'.。,，!！?？:：")]}、"
6
)

Web 界面部署#

使用 Gradio 快速搭建 Web 服务：

1
pip install gradio

创建 app.py：

1
import gradio as gr
2
import whisper
3

4
model = whisper.load_model("large-v3-turbo")
5

6
def transcribe_audio(audio_file, language):
7
    result = model.transcribe(audio_file, language=language)
8
    return result["text"]
9

10
demo = gr.Interface(
11
    fn=transcribe_audio,
12
    inputs=[
13
        gr.Audio(type="filepath", label="上传音频"),
14
        gr.Dropdown(["zh", "en", "ja", "ko"], value="zh", label="语言")
15
    ],
16
    outputs=gr.Textbox(label="转录结果"),
17
    title="Whisper 语音转文字",
18
    description="上传音频文件，自动转换为文字"
19
)
20

21
demo.launch(server_name="0.0.0.0", server_port=7860)

运行：

1
python app.py

访问 http://localhost:7860 即可使用。

Docker 部署#

创建 Dockerfile：

1
FROM python:3.10-slim
2

3
WORKDIR /app
4

5
RUN apt-get update && apt-get install -y ffmpeg git
6

7
RUN pip install openai-whisper gradio
8

9
COPY app.py .
10

11
EXPOSE 7860
12

13
CMD ["python", "app.py"]

构建并运行：

1
docker build -t whisper-app .
2
docker run -p 7860:7860 whisper-app

实战案例#

案例 1：会议录音转文字#

1
import whisper
2
from datetime import datetime
3

4
model = whisper.load_model("medium")
5

6
# 转录会议录音
7
result = model.transcribe(
8
    "meeting.m4a",
9
    language="zh",
10
    initial_prompt="这是一场关于产品规划的会议讨论",
11
    temperature=0.0
12
)
13

14
# 生成会议纪要
15
output = f"""
16
# 会议纪要
17
时间：{datetime.now().strftime('%Y-%m-%d %H:%M')}
18

19
## 完整记录
20
{result['text']}
21

22
## 分段内容
23
"""
24

25
for i, segment in enumerate(result['segments'], 1):
26
    output += f"\n{i}. [{segment['start']:.0f}s] {segment['text']}"
27

28
with open("meeting_notes.md", "w", encoding="utf-8") as f:
29
    f.write(output)

案例 2：视频批量生成字幕#

1
import whisper
2
import os
3
from pathlib import Path
4

5
model = whisper.load_model("small")
6

7
video_dir = Path("./videos")
8

9
for video in video_dir.glob("*.mp4"):
10
    print(f"处理: {video.name}")
11

12
    result = model.transcribe(
13
        str(video),
14
        language="zh",
15
        task="transcribe"
16
    )
17

18
    # 生成 SRT 字幕
19
    srt_path = video.with_suffix(".srt")
20

21
    with open(srt_path, "w", encoding="utf-8") as f:
22
        for i, segment in enumerate(result['segments'], 1):
23
            start = format_timestamp(segment['start'])
24
            end = format_timestamp(segment['end'])
25
            text = segment['text'].strip()
26

27
            f.write(f"{i}\n")
28
            f.write(f"{start} --> {end}\n")
29
            f.write(f"{text}\n\n")
30

31
    print(f"字幕已保存: {srt_path.name}\n")
32

33
def format_timestamp(seconds):
34
    hours = int(seconds // 3600)
35
    minutes = int((seconds % 3600) // 60)
36
    secs = int(seconds % 60)
37
    millis = int((seconds % 1) * 1000)
38
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

总结#

Whisper 本地部署的完整流程：

1. 环境准备：Python 3.8-3.11 + FFmpeg
2. 安装 Whisper：pip install openai-whisper
3. 选择模型：small 性价比最高，专业场景用 medium/large
4. 基础使用：命令行或 Python 脚本调用
5. 性能优化：GPU 加速 + faster-whisper 量化
6. Web 部署：Gradio 快速搭建界面

实测效果：

• 中文识别准确率：95%+（medium 模型）
• 英文识别准确率：98%+
• 处理速度：small 模型约 1 分钟音频耗时 10 秒（GPU）

相比商业 API，Whisper 本地部署的优势在于：

• 完全免费，无使用限制
• 数据隐私，音频不上传
• 可定制化，支持微调

如果你有大量音频转录需求，Whisper 绝对是最佳选择。

相关资源：

• Whisper GitHub: https://github.com/openai/whisper
• faster-whisper: https://github.com/guillaumekln/faster-whisper
• Whisper 论文: https://arxiv.org/abs/2212.04356

HYBG

环境准备#

系统要求#

安装 Python 环境#

安装 FFmpeg#

安装 Whisper#

方式一：pip 安装（推荐）#

方式二：从源码安装#

安装加速库（强烈推荐）#

模型选择#

基础使用#

命令行转录#

Python 脚本调用#

批量处理脚本#

进阶技巧#

1. 提升识别准确率#

2. 处理长音频#

3. 生成字幕文件#

4. 实时语音识别#

性能优化#

硬件加速 (CUDA / Mac MPS)#

量化加速#

批处理优化#

常见问题#

1. 中文识别不准确#

2. 内存不足#

3. 速度太慢#

4. 时间戳不准确#

Web 界面部署#

Docker 部署#

实战案例#

案例 1：会议录音转文字#

案例 2：视频批量生成字幕#

总结#