Skip to content

Instantly share code, notes, and snippets.

@danielrosehill
Created November 25, 2025 16:13
Show Gist options
  • Select an option

  • Save danielrosehill/06fb17e7462980f99efa9fdab2335a14 to your computer and use it in GitHub Desktop.

Select an option

Save danielrosehill/06fb17e7462980f99efa9fdab2335a14 to your computer and use it in GitHub Desktop.
Optimal Audio Input Settings for OpenAI Whisper Speech-to-Text

Optimal Audio Input Settings for OpenAI Whisper Speech-to-Text

This guide documents the recommended audio parameters for capturing voice input destined for OpenAI's Whisper API or local Whisper models. The goal is to minimize file size while preserving transcription accuracy.

TL;DR - Recommended Settings

Parameter Recommended Notes
Channels Mono (1) Whisper converts to mono internally anyway
Sample Rate 16 kHz Native processing rate; higher is downsampled
Bitrate 32-64 kbps Sweet spot for speech; 16 kbps still works
Bit Depth 16-bit Standard for speech
Format MP3 or M4A (AAC) Best compression for speech

Detailed Parameter Guide

Mono vs Stereo

Recommendation: Always use Mono

Whisper internally converts all audio to mono using the equivalent of:

ffmpeg -i input.wav -ac 1 output.wav

Recording in stereo doubles file size with zero benefit for transcription. If your microphone or recording software defaults to stereo, convert to mono before upload or configure your recording chain to capture mono directly.

Sample Rate

Recommendation: 16 kHz (or let Whisper downsample)

Whisper resamples all audio to 16 kHz before processing:

ffmpeg -i input.wav -ar 16000 output.wav
Original Rate File Size Impact Transcription Quality
48 kHz Largest No benefit
44.1 kHz Large No benefit
22.05 kHz Medium Equivalent
16 kHz Smallest Equivalent

If you're recording specifically for Whisper, capturing at 16 kHz saves processing time and storage. However, if you want to archive high-quality originals, record at your mic's native rate and downsample copies for transcription.

Bitrate (Compressed Formats)

Recommendation: 32-64 kbps for speech

This is the most impactful parameter for file size optimization. Studio microphones often default to 128-320 kbps, which is massive overkill for speech transcription.

Bitrate File Size (1 min) Transcription Quality Notes
128 kbps ~960 KB Excellent Overkill for speech
64 kbps ~480 KB Excellent Safe conservative choice
32 kbps ~240 KB Excellent Recommended minimum
16 kbps ~120 KB Good Works, but at the edge

Key finding: Transcription accuracy remains consistent down to 32 kbps at 12 kHz. Below this, you may encounter edge cases with unclear speech, but for clear voice recordings, even 16 kbps works.

Bit Depth

Recommendation: 16-bit

Whisper processes audio as 16-bit PCM internally:

ffmpeg -i input.wav -c:a pcm_s16le output.wav

24-bit or 32-bit recordings provide no transcription benefit and increase file size.


File Format Recommendations

For API Upload (25 MB limit)

Format Pros Cons
MP3 Universal support, good compression Slightly larger than modern codecs
M4A (AAC) Better compression than MP3 Slightly less universal
Opus Best compression, low latency Less tool support
WAV No compression artifacts Very large files

Supported formats: m4a, mp3, webm, mp4, mpga, wav, mpeg

Handling the 25 MB Limit

At recommended settings (mono, 16 kHz, 32 kbps MP3):

  • 25 MB ≈ 100+ minutes of audio

At these settings, you're unlikely to hit the limit for typical voice notes or dictation sessions.


FFmpeg Commands

Convert existing audio to Whisper-optimized format

# Basic conversion: mono, 16kHz, 32kbps MP3
ffmpeg -i input.wav -ac 1 -ar 16000 -b:a 32k output.mp3

# M4A/AAC variant (slightly smaller files)
ffmpeg -i input.wav -ac 1 -ar 16000 -b:a 32k -c:a aac output.m4a

# Conservative quality (64kbps, 22kHz)
ffmpeg -i input.wav -ac 1 -ar 22050 -b:a 64k output.mp3

Configure recording software

If using arecord or similar:

# Record directly at Whisper-optimal settings
arecord -f S16_LE -r 16000 -c 1 output.wav

Batch conversion

# Convert all WAV files in directory
for f in *.wav; do
  ffmpeg -i "$f" -ac 1 -ar 16000 -b:a 32k "${f%.wav}.mp3"
done

Real-World Benchmarks

From testing by the community:

Configuration File Size (short clip) API Response Time
16 kbps, 12 kHz 5 KB 1.8 seconds
32 kbps, 12 kHz 9 KB 1.8 seconds
128 kbps, 24 kHz 33 KB 2.6 seconds

Key insight: Over 50% latency reduction is achievable by optimizing audio format, with no measurable impact on transcription accuracy.


Phone/Mobile Recording Tips

Most phone voice memo apps record at unnecessarily high quality. When configuring:

  1. Set to mono if your app allows
  2. Choose "voice" or "speech" preset over "music" presets
  3. Select lower quality/smaller file options
  4. If no settings available, convert before upload using the FFmpeg commands above

Summary

For speech-to-text with Whisper:

  1. Don't overthink it: Whisper is remarkably tolerant of compressed audio
  2. Mono is mandatory: Stereo is wasted bandwidth
  3. 16 kHz sample rate: Higher rates are downsampled anyway
  4. 32-64 kbps: The sweet spot for speech; much lower than studio defaults
  5. MP3 or M4A: Both work well; choose based on your toolchain

The rule of thumb: if you can understand it, Whisper probably can too.


References


This gist was generated by Claude Code. Please validate the information against current OpenAI documentation, as API specifications may change over time.

Comments are disabled for this gist.