basperheim/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Vosk Transcriber (MP3/WAV → Text)

A small Python CLI that transcribes audio using Vosk.
If you pass an .mp3, it automatically converts to a 16 kHz mono WAV via ffmpeg and then transcribes. If you pass a .wav, it transcribes directly.

Input: .mp3 or .wav
Output: .txt with the same base name as the input (e.g., audio.mp3 → audio.txt)
Engine: Vosk offline speech recognition


Features


🧠 Offline (no network calls once the model is downloaded)
🔁 MP3 → WAV conversion (16 kHz mono) via ffmpeg
📝 Writes a clean transcript to <input>.txt
⚠️ Warns if the WAV isn’t 16 kHz mono (still works, but quality may drop)


Quick Start

# 1) Create and activate a Python venv (recommended)
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

# 2) Install dependencies
pip install vosk

# 3) Install ffmpeg (required for mp3 inputs)
# macOS (Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y ffmpeg
# Windows (Chocolatey, admin PowerShell)
choco install ffmpeg -y

# 4) Download a Vosk model (see "Models" below), e.g.:
#   vosk-model-en-us-0.42-gigaspeech
#   Extract it somewhere (e.g., ./vosk-model-en-us-0.42-gigaspeech)

# 5) Run
python transcribe.py path/to/audio.mp3 --model ./vosk-model-en-us-0.42-gigaspeech
# or
python transcribe.py path/to/audio.wav --model ./vosk-model-en-us-0.42-gigaspeech

# Output:
#   path/to/audio.txt

Models (Download)

Get prebuilt speech models from the official Vosk site:

English and many other languages: https://alphacephei.com/vosk/models
Each model page shows size, expected quality, and CPU/GPU requirements.

Tips

Start with a small/medium model to validate your setup.
For better accuracy, try vosk-model-en-us-0.42-gigaspeech or newer.
Large models improve accuracy but require more RAM and disk.


The Script

Your transcribe.py (summarized):


Uses argparse to accept:

input_path: .mp3 or .wav
--model: path to the Vosk model directory
--out-wav (optional): override the intermediate WAV path for .mp3 inputs


If input ends with .mp3:

Runs ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
Transcribes the WAV with Vosk


If input ends with .wav:

Transcribes directly (warns if not 16 kHz mono)


Writes the final transcript to <original-input>.txt


Usage

# Transcribe an MP3 (auto-converts to WAV first)
python transcribe.py FDR_2906_Climate_Change_WNOWTA.mp3 \
  --model ./vosk-model-en-us-0.42-gigaspeech

# Transcribe a WAV directly
python transcribe.py interview.wav \
  --model ./vosk-model-en-us-0.42-gigaspeech

# Choose a custom intermediate WAV path for mp3 inputs
python transcribe.py audio.mp3 \
  --model ./vosk-model-en-us-0.42-gigaspeech \
  --out-wav ./tmp/working.wav
Output

FDR_2906_Climate_Change_WNOWTA.txt (for the first command)
interview.txt (for the second)


Platform Notes

macOS


Install Homebrew if you don’t have it: https://brew.sh
brew install ffmpeg
Python 3.10+ recommended. Use the system Python or pyenv.

Linux (Debian/Ubuntu)

sudo apt-get update
sudo apt-get install -y ffmpeg python3 python3-venv
Windows


Install Python from the Microsoft Store or python.org.

Install ffmpeg via Chocolatey:
choco install ffmpeg -y

Ensure ffmpeg.exe is in PATH (where ffmpeg should work in CMD/PowerShell).


Accuracy & Performance


Best input: 16 kHz, mono, 16-bit PCM WAV.
The script enforces this when you start from MP3.
Noisy audio? Consider pre-denoising with tools like ffmpeg filters or specialized utilities.
Model choice drives accuracy more than anything else. Try progressively larger models if your machine can handle them.
Chunk size: The script reads 4000 frames per loop; you can tweak for throughput.


Troubleshooting


ERROR: ffmpeg not found in PATH
Install ffmpeg and ensure it’s on your PATH (ffmpeg -version should work).


ERROR: Vosk model not found
Check --model path points to the extracted model directory (contains am, conf, etc.).


Garbage output / wrong language
You’re probably using a model for a different language or too small a model. Switch to the correct/larger model.


Very slow or crashes
Use a smaller model; close memory-hungry apps; ensure you’re on 64-bit Python.


Why 16 kHz Mono?

Vosk expects mono audio and works very well at 16 kHz. Converting keeps the recognizer happy and avoids channel-mixing artifacts.

Development Tips


Pin vosk in requirements.txt if you’re distributing:

vosk==0.3.45   # example; use the latest that works for you


If you want live mic input later, look at sounddevice + Vosk streaming examples in the Vosk repo.


References


Vosk API (GitHub): https://github.com/alphacep/vosk-api
Vosk Models: https://alphacephei.com/vosk/models
FFmpeg: https://ffmpeg.org/


License

This script is yours; Vosk and the models have their own licenses—check each model’s page and the Vosk repo for details.

  
## transcriber.py
#!/usr/bin/env python3
import argparse
import json
import shutil
import subprocess
import sys
import wave
from pathlib import Path
from typing import List, Optional

import vosk


def ensure_ffmpeg() -> None:
    if shutil.which("ffmpeg") is None:
        sys.stderr.write("ERROR: ffmpeg not found in PATH. Install ffmpeg first.\n")
        sys.exit(1)


def convert_mp3_to_wav(mp3_path: Path, out_wav: Optional[Path] = None) -> Path:
    ensure_ffmpeg()
    if out_wav is None:
        out_wav = mp3_path.with_suffix(".wav")
    cmd = [
        "ffmpeg",
        "-y",               # overwrite if exists
        "-i", str(mp3_path),
        "-ar", "16000",     # resample to 16 kHz
        "-ac", "1",         # mono
        str(out_wav),
    ]
    try:
        subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    except subprocess.CalledProcessError as e:
        sys.stderr.write(f"ERROR: ffmpeg failed converting '{mp3_path.name}'.\n")
        sys.stderr.write(e.stderr.decode(errors="ignore"))
        sys.exit(1)
    return out_wav


def open_wave_for_read(path: Path) -> wave.Wave_read:
    try:
        return wave.open(str(path), "rb")
    except wave.Error as e:
        sys.stderr.write(f"ERROR: Unable to read WAV file '{path}': {e}\n")
        sys.exit(1)


def _extract_text_from_result(json_str: str) -> str:
    """
    Parse Vosk recognizer Result()/FinalResult() JSON and return the 'text' field (or empty string).
    """
    try:
        obj = json.loads(json_str)
        text = obj.get("text", "")
        return text.strip()
    except json.JSONDecodeError:
        # Fallback: return raw if JSON failed (shouldn't happen with Vosk)
        return json_str.strip()


def transcript_from_wav(wav_path: Path, model_path: Path) -> List[str]:
    """
    Transcribe a WAV file and return a list of finalized text segments.
    """
    if not model_path.exists():
        sys.stderr.write(f"ERROR: Vosk model not found at '{model_path}'.\n")
        sys.exit(1)

    segments: List[str] = []

    wf = open_wave_for_read(wav_path)
    with wf:
        n_channels = wf.getnchannels()
        sample_width = wf.getsampwidth()
        framerate = wf.getframerate()

        if n_channels != 1 or framerate != 16000:
            sys.stderr.write(
                f"WARNING: WAV is {framerate} Hz, {n_channels} channel(s). "
                "Vosk works best with 16 kHz mono PCM.\n"
            )
        if sample_width not in (2,):
            sys.stderr.write(
                f"WARNING: Sample width is {sample_width*8} bits. "
                "16-bit PCM is recommended.\n"
            )

        model = vosk.Model(str(model_path))
        recognizer = vosk.KaldiRecognizer(model, framerate)

        # Stream frames; only collect finalized segments
        while True:
            data = wf.readframes(4000)
            if not data:
                break
            if recognizer.AcceptWaveform(data):
                seg_text = _extract_text_from_result(recognizer.Result())
                if seg_text:
                    segments.append(seg_text)

        # Final tail
        final_text = _extract_text_from_result(recognizer.FinalResult())
        if final_text:
            segments.append(final_text)

    return segments


def write_transcript(output_txt: Path, segments: List[str]) -> None:
    output_txt.write_text(" ".join(segments).strip() + ("\n" if segments else ""), encoding="utf-8")


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Transcribe audio using Vosk. Converts MP3 to 16kHz mono WAV via ffmpeg if needed, and writes a .txt transcript."
    )
    parser.add_argument(
        "input_path",
        type=Path,
        help="Path to input .mp3 or .wav file",
    )
    parser.add_argument(
        "--model",
        type=Path,
        default=Path("./vosk-model-en-us-0.42-gigaspeech"),
        help="Path to Vosk model directory (default: ./vosk-model-en-us-0.42-gigaspeech)",
    )
    parser.add_argument(
        "--out-wav",
        type=Path,
        default=None,
        help="Optional explicit output WAV path when converting from MP3",
    )
    args = parser.parse_args()

    in_path: Path = args.input_path
    model_path: Path = args.model

    if not in_path.exists():
        sys.stderr.write(f"ERROR: Input file not found: {in_path}\n")
        sys.exit(1)

    # Always base the .txt name on the ORIGINAL input file
    out_txt: Path = in_path.with_suffix(".txt")

    ext = in_path.suffix.lower()
    if ext == ".mp3":
        wav_path = convert_mp3_to_wav(in_path, args.out_wav)
        segments = transcript_from_wav(wav_path, model_path)
    elif ext == ".wav":
        segments = transcript_from_wav(in_path, model_path)
    else:
        sys.stderr.write("ERROR: Only .mp3 and .wav inputs are supported.\n")
        sys.exit(1)

    write_transcript(out_txt, segments)
    sys.stderr.write(f"Transcript written to: {out_txt}\n")


if __name__ == "__main__":
    main()
	#!/usr/bin/env python3
	import argparse
	import json
	import shutil
	import subprocess
	import sys
	import wave
	from pathlib import Path
	from typing import List, Optional

	import vosk


	def ensure_ffmpeg() -> None:
	if shutil.which("ffmpeg") is None:
	sys.stderr.write("ERROR: ffmpeg not found in PATH. Install ffmpeg first.\n")
	sys.exit(1)


	def convert_mp3_to_wav(mp3_path: Path, out_wav: Optional[Path] = None) -> Path:
	ensure_ffmpeg()
	if out_wav is None:
	out_wav = mp3_path.with_suffix(".wav")
	cmd = [
	"ffmpeg",
	"-y", # overwrite if exists
	"-i", str(mp3_path),
	"-ar", "16000", # resample to 16 kHz
	"-ac", "1", # mono
	str(out_wav),
	]
	try:
	subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	except subprocess.CalledProcessError as e:
	sys.stderr.write(f"ERROR: ffmpeg failed converting '{mp3_path.name}'.\n")
	sys.stderr.write(e.stderr.decode(errors="ignore"))
	sys.exit(1)
	return out_wav


	def open_wave_for_read(path: Path) -> wave.Wave_read:
	try:
	return wave.open(str(path), "rb")
	except wave.Error as e:
	sys.stderr.write(f"ERROR: Unable to read WAV file '{path}': {e}\n")
	sys.exit(1)


	def _extract_text_from_result(json_str: str) -> str:
	"""
	Parse Vosk recognizer Result()/FinalResult() JSON and return the 'text' field (or empty string).
	"""
	try:
	obj = json.loads(json_str)
	text = obj.get("text", "")
	return text.strip()
	except json.JSONDecodeError:
	# Fallback: return raw if JSON failed (shouldn't happen with Vosk)
	return json_str.strip()


	def transcript_from_wav(wav_path: Path, model_path: Path) -> List[str]:
	"""
	Transcribe a WAV file and return a list of finalized text segments.
	"""
	if not model_path.exists():
	sys.stderr.write(f"ERROR: Vosk model not found at '{model_path}'.\n")
	sys.exit(1)

	segments: List[str] = []

	wf = open_wave_for_read(wav_path)
	with wf:
	n_channels = wf.getnchannels()
	sample_width = wf.getsampwidth()
	framerate = wf.getframerate()

	if n_channels != 1 or framerate != 16000:
	sys.stderr.write(
	f"WARNING: WAV is {framerate} Hz, {n_channels} channel(s). "
	"Vosk works best with 16 kHz mono PCM.\n"
	)
	if sample_width not in (2,):
	sys.stderr.write(
	f"WARNING: Sample width is {sample_width*8} bits. "
	"16-bit PCM is recommended.\n"
	)

	model = vosk.Model(str(model_path))
	recognizer = vosk.KaldiRecognizer(model, framerate)

	# Stream frames; only collect finalized segments
	while True:
	data = wf.readframes(4000)
	if not data:
	break
	if recognizer.AcceptWaveform(data):
	seg_text = _extract_text_from_result(recognizer.Result())
	if seg_text:
	segments.append(seg_text)

	# Final tail
	final_text = _extract_text_from_result(recognizer.FinalResult())
	if final_text:
	segments.append(final_text)

	return segments


	def write_transcript(output_txt: Path, segments: List[str]) -> None:
	output_txt.write_text(" ".join(segments).strip() + ("\n" if segments else ""), encoding="utf-8")


	def main() -> None:
	parser = argparse.ArgumentParser(
	description="Transcribe audio using Vosk. Converts MP3 to 16kHz mono WAV via ffmpeg if needed, and writes a .txt transcript."
	)
	parser.add_argument(
	"input_path",
	type=Path,
	help="Path to input .mp3 or .wav file",
	)
	parser.add_argument(
	"--model",
	type=Path,
	default=Path("./vosk-model-en-us-0.42-gigaspeech"),
	help="Path to Vosk model directory (default: ./vosk-model-en-us-0.42-gigaspeech)",
	)
	parser.add_argument(
	"--out-wav",
	type=Path,
	default=None,
	help="Optional explicit output WAV path when converting from MP3",
	)
	args = parser.parse_args()

	in_path: Path = args.input_path
	model_path: Path = args.model

	if not in_path.exists():
	sys.stderr.write(f"ERROR: Input file not found: {in_path}\n")
	sys.exit(1)

	# Always base the .txt name on the ORIGINAL input file
	out_txt: Path = in_path.with_suffix(".txt")

	ext = in_path.suffix.lower()
	if ext == ".mp3":
	wav_path = convert_mp3_to_wav(in_path, args.out_wav)
	segments = transcript_from_wav(wav_path, model_path)
	elif ext == ".wav":
	segments = transcript_from_wav(in_path, model_path)
	else:
	sys.stderr.write("ERROR: Only .mp3 and .wav inputs are supported.\n")
	sys.exit(1)

	write_transcript(out_txt, segments)
	sys.stderr.write(f"Transcript written to: {out_txt}\n")


	if __name__ == "__main__":
	main()
No results found