Skip to content

Instantly share code, notes, and snippets.

@basperheim
Last active October 13, 2025 14:44
Show Gist options
  • Select an option

  • Save basperheim/b5b6272933e74bf93620b4ca8e001890 to your computer and use it in GitHub Desktop.

Select an option

Save basperheim/b5b6272933e74bf93620b4ca8e001890 to your computer and use it in GitHub Desktop.
Use Python/Vosk to Create Transcripts from Audio Files

Vosk Transcriber (MP3/WAV → Text)

A small Python CLI that transcribes audio using Vosk. If you pass an .mp3, it automatically converts to a 16 kHz mono WAV via ffmpeg and then transcribes. If you pass a .wav, it transcribes directly.


Features

  • 🧠 Offline (no network calls once the model is downloaded)
  • 🔁 MP3 → WAV conversion (16 kHz mono) via ffmpeg
  • 📝 Writes a clean transcript to <input>.txt
  • ⚠️ Warns if the WAV isn’t 16 kHz mono (still works, but quality may drop)

Quick Start

# 1) Create and activate a Python venv (recommended)
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

# 2) Install dependencies
pip install vosk

# 3) Install ffmpeg (required for mp3 inputs)
# macOS (Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y ffmpeg
# Windows (Chocolatey, admin PowerShell)
choco install ffmpeg -y

# 4) Download a Vosk model (see "Models" below), e.g.:
#   vosk-model-en-us-0.42-gigaspeech
#   Extract it somewhere (e.g., ./vosk-model-en-us-0.42-gigaspeech)

# 5) Run
python transcribe.py path/to/audio.mp3 --model ./vosk-model-en-us-0.42-gigaspeech
# or
python transcribe.py path/to/audio.wav --model ./vosk-model-en-us-0.42-gigaspeech

# Output:
#   path/to/audio.txt

Models (Download)

Get prebuilt speech models from the official Vosk site:

Tips

  • Start with a small/medium model to validate your setup.
  • For better accuracy, try vosk-model-en-us-0.42-gigaspeech or newer.
  • Large models improve accuracy but require more RAM and disk.

The Script

Your transcribe.py (summarized):

  • Uses argparse to accept:

    • input_path: .mp3 or .wav
    • --model: path to the Vosk model directory
    • --out-wav (optional): override the intermediate WAV path for .mp3 inputs
  • If input ends with .mp3:

    • Runs ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
    • Transcribes the WAV with Vosk
  • If input ends with .wav:

    • Transcribes directly (warns if not 16 kHz mono)
  • Writes the final transcript to <original-input>.txt


Usage

# Transcribe an MP3 (auto-converts to WAV first)
python transcribe.py FDR_2906_Climate_Change_WNOWTA.mp3 \
  --model ./vosk-model-en-us-0.42-gigaspeech

# Transcribe a WAV directly
python transcribe.py interview.wav \
  --model ./vosk-model-en-us-0.42-gigaspeech

# Choose a custom intermediate WAV path for mp3 inputs
python transcribe.py audio.mp3 \
  --model ./vosk-model-en-us-0.42-gigaspeech \
  --out-wav ./tmp/working.wav

Output

  • FDR_2906_Climate_Change_WNOWTA.txt (for the first command)
  • interview.txt (for the second)

Platform Notes

macOS

  • Install Homebrew if you don’t have it: https://brew.sh
  • brew install ffmpeg
  • Python 3.10+ recommended. Use the system Python or pyenv.

Linux (Debian/Ubuntu)

sudo apt-get update
sudo apt-get install -y ffmpeg python3 python3-venv

Windows

  • Install Python from the Microsoft Store or python.org.

Install ffmpeg via Chocolatey:

choco install ffmpeg -y
  • Ensure ffmpeg.exe is in PATH (where ffmpeg should work in CMD/PowerShell).

Accuracy & Performance

  • Best input: 16 kHz, mono, 16-bit PCM WAV. The script enforces this when you start from MP3.
  • Noisy audio? Consider pre-denoising with tools like ffmpeg filters or specialized utilities.
  • Model choice drives accuracy more than anything else. Try progressively larger models if your machine can handle them.
  • Chunk size: The script reads 4000 frames per loop; you can tweak for throughput.

Troubleshooting

  • ERROR: ffmpeg not found in PATH Install ffmpeg and ensure it’s on your PATH (ffmpeg -version should work).

  • ERROR: Vosk model not found Check --model path points to the extracted model directory (contains am, conf, etc.).

  • Garbage output / wrong language You’re probably using a model for a different language or too small a model. Switch to the correct/larger model.

  • Very slow or crashes Use a smaller model; close memory-hungry apps; ensure you’re on 64-bit Python.


Why 16 kHz Mono?

Vosk expects mono audio and works very well at 16 kHz. Converting keeps the recognizer happy and avoids channel-mixing artifacts.


Development Tips

  • Pin vosk in requirements.txt if you’re distributing:
vosk==0.3.45   # example; use the latest that works for you
  • If you want live mic input later, look at sounddevice + Vosk streaming examples in the Vosk repo.

References


License

This script is yours; Vosk and the models have their own licenses—check each model’s page and the Vosk repo for details.

#!/usr/bin/env python3
import argparse
import json
import shutil
import subprocess
import sys
import wave
from pathlib import Path
from typing import List, Optional
import vosk
def ensure_ffmpeg() -> None:
if shutil.which("ffmpeg") is None:
sys.stderr.write("ERROR: ffmpeg not found in PATH. Install ffmpeg first.\n")
sys.exit(1)
def convert_mp3_to_wav(mp3_path: Path, out_wav: Optional[Path] = None) -> Path:
ensure_ffmpeg()
if out_wav is None:
out_wav = mp3_path.with_suffix(".wav")
cmd = [
"ffmpeg",
"-y", # overwrite if exists
"-i", str(mp3_path),
"-ar", "16000", # resample to 16 kHz
"-ac", "1", # mono
str(out_wav),
]
try:
subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
sys.stderr.write(f"ERROR: ffmpeg failed converting '{mp3_path.name}'.\n")
sys.stderr.write(e.stderr.decode(errors="ignore"))
sys.exit(1)
return out_wav
def open_wave_for_read(path: Path) -> wave.Wave_read:
try:
return wave.open(str(path), "rb")
except wave.Error as e:
sys.stderr.write(f"ERROR: Unable to read WAV file '{path}': {e}\n")
sys.exit(1)
def _extract_text_from_result(json_str: str) -> str:
"""
Parse Vosk recognizer Result()/FinalResult() JSON and return the 'text' field (or empty string).
"""
try:
obj = json.loads(json_str)
text = obj.get("text", "")
return text.strip()
except json.JSONDecodeError:
# Fallback: return raw if JSON failed (shouldn't happen with Vosk)
return json_str.strip()
def transcript_from_wav(wav_path: Path, model_path: Path) -> List[str]:
"""
Transcribe a WAV file and return a list of finalized text segments.
"""
if not model_path.exists():
sys.stderr.write(f"ERROR: Vosk model not found at '{model_path}'.\n")
sys.exit(1)
segments: List[str] = []
wf = open_wave_for_read(wav_path)
with wf:
n_channels = wf.getnchannels()
sample_width = wf.getsampwidth()
framerate = wf.getframerate()
if n_channels != 1 or framerate != 16000:
sys.stderr.write(
f"WARNING: WAV is {framerate} Hz, {n_channels} channel(s). "
"Vosk works best with 16 kHz mono PCM.\n"
)
if sample_width not in (2,):
sys.stderr.write(
f"WARNING: Sample width is {sample_width*8} bits. "
"16-bit PCM is recommended.\n"
)
model = vosk.Model(str(model_path))
recognizer = vosk.KaldiRecognizer(model, framerate)
# Stream frames; only collect finalized segments
while True:
data = wf.readframes(4000)
if not data:
break
if recognizer.AcceptWaveform(data):
seg_text = _extract_text_from_result(recognizer.Result())
if seg_text:
segments.append(seg_text)
# Final tail
final_text = _extract_text_from_result(recognizer.FinalResult())
if final_text:
segments.append(final_text)
return segments
def write_transcript(output_txt: Path, segments: List[str]) -> None:
output_txt.write_text(" ".join(segments).strip() + ("\n" if segments else ""), encoding="utf-8")
def main() -> None:
parser = argparse.ArgumentParser(
description="Transcribe audio using Vosk. Converts MP3 to 16kHz mono WAV via ffmpeg if needed, and writes a .txt transcript."
)
parser.add_argument(
"input_path",
type=Path,
help="Path to input .mp3 or .wav file",
)
parser.add_argument(
"--model",
type=Path,
default=Path("./vosk-model-en-us-0.42-gigaspeech"),
help="Path to Vosk model directory (default: ./vosk-model-en-us-0.42-gigaspeech)",
)
parser.add_argument(
"--out-wav",
type=Path,
default=None,
help="Optional explicit output WAV path when converting from MP3",
)
args = parser.parse_args()
in_path: Path = args.input_path
model_path: Path = args.model
if not in_path.exists():
sys.stderr.write(f"ERROR: Input file not found: {in_path}\n")
sys.exit(1)
# Always base the .txt name on the ORIGINAL input file
out_txt: Path = in_path.with_suffix(".txt")
ext = in_path.suffix.lower()
if ext == ".mp3":
wav_path = convert_mp3_to_wav(in_path, args.out_wav)
segments = transcript_from_wav(wav_path, model_path)
elif ext == ".wav":
segments = transcript_from_wav(in_path, model_path)
else:
sys.stderr.write("ERROR: Only .mp3 and .wav inputs are supported.\n")
sys.exit(1)
write_transcript(out_txt, segments)
sys.stderr.write(f"Transcript written to: {out_txt}\n")
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment