A small Python CLI that transcribes audio using Vosk.
If you pass an .mp3, it automatically converts to a 16 kHz mono WAV via ffmpeg and then transcribes. If you pass a .wav, it transcribes directly.
- Input:
.mp3or.wav - Output:
.txtwith the same base name as the input (e.g.,audio.mp3→audio.txt) - Engine: Vosk offline speech recognition
- 🧠 Offline (no network calls once the model is downloaded)
- 🔁 MP3 → WAV conversion (16 kHz mono) via
ffmpeg - 📝 Writes a clean transcript to
<input>.txt ⚠️ Warns if the WAV isn’t 16 kHz mono (still works, but quality may drop)
# 1) Create and activate a Python venv (recommended)
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# 2) Install dependencies
pip install vosk
# 3) Install ffmpeg (required for mp3 inputs)
# macOS (Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y ffmpeg
# Windows (Chocolatey, admin PowerShell)
choco install ffmpeg -y
# 4) Download a Vosk model (see "Models" below), e.g.:
# vosk-model-en-us-0.42-gigaspeech
# Extract it somewhere (e.g., ./vosk-model-en-us-0.42-gigaspeech)
# 5) Run
python transcribe.py path/to/audio.mp3 --model ./vosk-model-en-us-0.42-gigaspeech
# or
python transcribe.py path/to/audio.wav --model ./vosk-model-en-us-0.42-gigaspeech
# Output:
# path/to/audio.txtGet prebuilt speech models from the official Vosk site:
- English and many other languages: https://alphacephei.com/vosk/models
- Each model page shows size, expected quality, and CPU/GPU requirements.
Tips
- Start with a small/medium model to validate your setup.
- For better accuracy, try
vosk-model-en-us-0.42-gigaspeechor newer. - Large models improve accuracy but require more RAM and disk.
Your transcribe.py (summarized):
-
Uses
argparseto accept:input_path:.mp3or.wav--model: path to the Vosk model directory--out-wav(optional): override the intermediate WAV path for.mp3inputs
-
If input ends with
.mp3:- Runs
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav - Transcribes the WAV with Vosk
- Runs
-
If input ends with
.wav:- Transcribes directly (warns if not 16 kHz mono)
-
Writes the final transcript to
<original-input>.txt
# Transcribe an MP3 (auto-converts to WAV first)
python transcribe.py FDR_2906_Climate_Change_WNOWTA.mp3 \
--model ./vosk-model-en-us-0.42-gigaspeech
# Transcribe a WAV directly
python transcribe.py interview.wav \
--model ./vosk-model-en-us-0.42-gigaspeech
# Choose a custom intermediate WAV path for mp3 inputs
python transcribe.py audio.mp3 \
--model ./vosk-model-en-us-0.42-gigaspeech \
--out-wav ./tmp/working.wavOutput
FDR_2906_Climate_Change_WNOWTA.txt(for the first command)interview.txt(for the second)
- Install Homebrew if you don’t have it: https://brew.sh
brew install ffmpeg- Python 3.10+ recommended. Use the system Python or
pyenv.
sudo apt-get update
sudo apt-get install -y ffmpeg python3 python3-venv- Install Python from the Microsoft Store or python.org.
Install ffmpeg via Chocolatey:
choco install ffmpeg -y- Ensure
ffmpeg.exeis inPATH(where ffmpegshould work in CMD/PowerShell).
- Best input: 16 kHz, mono, 16-bit PCM WAV. The script enforces this when you start from MP3.
- Noisy audio? Consider pre-denoising with tools like
ffmpegfilters or specialized utilities. - Model choice drives accuracy more than anything else. Try progressively larger models if your machine can handle them.
- Chunk size: The script reads 4000 frames per loop; you can tweak for throughput.
-
ERROR: ffmpeg not found in PATHInstallffmpegand ensure it’s on your PATH (ffmpeg -versionshould work). -
ERROR: Vosk model not foundCheck--modelpath points to the extracted model directory (containsam,conf, etc.). -
Garbage output / wrong language You’re probably using a model for a different language or too small a model. Switch to the correct/larger model.
-
Very slow or crashes Use a smaller model; close memory-hungry apps; ensure you’re on 64-bit Python.
Vosk expects mono audio and works very well at 16 kHz. Converting keeps the recognizer happy and avoids channel-mixing artifacts.
- Pin
voskinrequirements.txtif you’re distributing:
vosk==0.3.45 # example; use the latest that works for you
- If you want live mic input later, look at
sounddevice+ Vosk streaming examples in the Vosk repo.
- Vosk API (GitHub): https://github.com/alphacep/vosk-api
- Vosk Models: https://alphacephei.com/vosk/models
- FFmpeg: https://ffmpeg.org/
This script is yours; Vosk and the models have their own licenses—check each model’s page and the Vosk repo for details.