AI Skill Hub 强烈推荐:WhisperX 对齐时间轴字幕 是一款优质的AI工具。在 GitHub 上收获超过 12.0k 颗 Star,AI 综合评分 8.8 分,在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案,这是一个值得深入了解的选择。
基于OpenAI Whisper的增强版本,提供精确到词级的时间戳对齐和说话人分离功能。适合需要高精度字幕生成、播客处理、视频字幕制作的开发者和内容创作者使用。
WhisperX 对齐时间轴字幕 是一款基于 Python 开发的开源工具,专注于 语音识别、字幕生成、说话人分离 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
基于OpenAI Whisper的增强版本,提供精确到词级的时间戳对齐和说话人分离功能。适合需要高精度字幕生成、播客处理、视频字幕制作的开发者和内容创作者使用。
WhisperX 对齐时间轴字幕 是一款基于 Python 开发的开源工具,专注于 语音识别、字幕生成、说话人分离 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install whisperx
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install whisperx
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/m-bain/whisperX
cd whisperX
pip install -e .
# 验证安装
python -c "import whisperx; print('安装成功')"
# 命令行使用
whisperx --help
# 基本用法
whisperx input_file -o output_file
# Python 代码中调用
import whisperx
# 示例
result = whisperx.process("input")
print(result)
# whisperx 配置文件示例(config.yml) app: name: "whisperx" debug: false log_level: "INFO" # 运行时指定配置文件 whisperx --config config.yml # 或通过环境变量配置 export WHISPERX_API_KEY="your-key" export WHISPERX_OUTPUT_DIR="./output"
To use WhisperX with GPU acceleration, install the CUDA toolkit 12.8 before WhisperX. Skip this step if using only the CPU.
- For Linux users, install the CUDA toolkit 12.8 following this guide: CUDA Installation Guide for Linux. - For Windows users, download and install the CUDA toolkit 12.8: CUDA Downloads.
The easiest way to install WhisperX is through PyPi:
pip install whisperx
Or if using uvx:
uvx whisperx
These installation methods are for developers or users with specific needs. If you're not sure, stick with the simple installation above.
To install directly from the GitHub repository:
uvx git+https://github.com/m-bain/whisperX.git
If you want to modify the code or contribute to the project:
git clone https://github.com/m-bain/whisperX.git
cd whisperX
uv sync --all-extras --dev
Note: The development version may contain experimental features and bugs. Use the stable PyPI release for production environments.
You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.
```python import whisperx import gc from whisperx.diarize import DiarizationPipeline
device = "cuda" audio_file = "audio.mp3" batch_size = 16 # reduce if low on GPU mem compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
If you don't have access to your own GPUs, use the links above to try out WhisperX.
For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint paper.
To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):
--batch_size 4--model base--compute_type int8Transcription differences from openai's whisper:
--without_timestamps True, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.--condition_on_prev_text is set to False by default (reduces hallucination)If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.
Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.
Contact maxhbain@gmail.com for queries.
<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
This work, and my PhD, is supported by the VGG (Visual Geometry Group) and the University of Oxford.
Of course, this is builds on openAI's whisper. Borrows important alignment code from PyTorch tutorial on forced alignment And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
Valuable VAD & Diarization Models from:
Great backend from faster-whisper and CTranslate2
Those who have supported this work financially 🙏
Finally, thanks to the OS contributors of this project, keeping it going and identifying bugs.
@article{bain2022whisperx,
title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
journal={INTERSPEECH 2023},
year={2023}
}
If you’re looking for a transcription API for meetings, consider checking out Recall.ai's Meeting Transcription API, an API that works with Zoom, Google Meet, Microsoft Teams, and more. Recall.ai diarizes by pulling the speaker data and separate audio streams from the meeting platforms, which means 100% accurate speaker diarization with actual speaker names.
<p align="center"> <a href="https://github.com/m-bain/whisperX/stargazers"> <img src="https://img.shields.io/github/stars/m-bain/whisperX.svg?colorA=orange&colorB=orange&logo=github" alt="GitHub stars"> </a> <a href="https://github.com/m-bain/whisperX/issues"> <img src="https://img.shields.io/github/issues/m-bain/whisperx.svg" alt="GitHub issues"> </a> <a href="https://github.com/m-bain/whisperX/blob/master/LICENSE"> <img src="https://img.shields.io/github/license/m-bain/whisperX.svg" alt="GitHub license"> </a> <a href="https://arxiv.org/abs/2303.00747"> <img src="http://img.shields.io/badge/Arxiv-2303.00747-B31B1B.svg" alt="ArXiv paper"> </a> <a href="https://twitter.com/intent/tweet?text=&url=https%3A%2F%2Fgithub.com%2Fm-bain%2FwhisperX"> <img src="https://img.shields.io/twitter/url/https/github.com/m-bain/whisperX.svg?style=social" alt="Twitter"> </a> </p>
<img width="1216" align="center" alt="whisperx-arch" src="https://raw.githubusercontent.com/m-bain/whisperX/refs/heads/main/figures/pipeline.png">
This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.
Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.
Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.
Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.
Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
aiskill88点评:成熟的语音处理增强工具,12K星标体现高认可度。词级时间戳与说话人分离功能业界领先,代码活跃度高,适合生产环境使用。
该工具使用 BSD-4-Clause 协议,商用场景请仔细阅读协议条款,必要时咨询法律意见。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
📄 BSD-4-Clause — 请查阅原始协议条款了解具体使用限制。
总体来看,WhisperX 对齐时间轴字幕 是一款质量优秀的AI工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | WhisperX |
| 原始描述 | 带时间戳对齐的 Whisper 增强版,精确到词级字幕时间轴,支持说话人分离 |
| Topics | 语音识别字幕生成说话人分离时间戳对齐多语言转录 |
| GitHub | https://github.com/m-bain/whisperX |
| License | BSD-4-Clause |
| 语言 | Python |
收录时间:2026-05-13 · 更新时间:2026-05-26 · License:BSD-4-Clause · AI Skill Hub 不对第三方内容的准确性作法律背书。