This guide documents the recommended audio parameters for capturing voice input destined for OpenAI's Whisper API or local Whisper models. The goal is to minimize file size while preserving transcription accuracy.
| Parameter | Recommended | Notes |
|---|---|---|
| Channels | Mono (1) | Whisper converts to mono internally anyway |
| Sample Rate | 16 kHz | Native processing rate; higher is downsampled |
| Bitrate | 32-64 kbps | Sweet spot for speech; 16 kbps still works |
| Bit Depth | 16-bit | Standard for speech |
| Format | MP3 or M4A (AAC) | Best compression for speech |
Recommendation: Always use Mono
Whisper internally converts all audio to mono using the equivalent of:
ffmpeg -i input.wav -ac 1 output.wavRecording in stereo doubles file size with zero benefit for transcription. If your microphone or recording software defaults to stereo, convert to mono before upload or configure your recording chain to capture mono directly.
Recommendation: 16 kHz (or let Whisper downsample)
Whisper resamples all audio to 16 kHz before processing:
ffmpeg -i input.wav -ar 16000 output.wav| Original Rate | File Size Impact | Transcription Quality |
|---|---|---|
| 48 kHz | Largest | No benefit |
| 44.1 kHz | Large | No benefit |
| 22.05 kHz | Medium | Equivalent |
| 16 kHz | Smallest | Equivalent |
If you're recording specifically for Whisper, capturing at 16 kHz saves processing time and storage. However, if you want to archive high-quality originals, record at your mic's native rate and downsample copies for transcription.
Recommendation: 32-64 kbps for speech
This is the most impactful parameter for file size optimization. Studio microphones often default to 128-320 kbps, which is massive overkill for speech transcription.
| Bitrate | File Size (1 min) | Transcription Quality | Notes |
|---|---|---|---|
| 128 kbps | ~960 KB | Excellent | Overkill for speech |
| 64 kbps | ~480 KB | Excellent | Safe conservative choice |
| 32 kbps | ~240 KB | Excellent | Recommended minimum |
| 16 kbps | ~120 KB | Good | Works, but at the edge |
Key finding: Transcription accuracy remains consistent down to 32 kbps at 12 kHz. Below this, you may encounter edge cases with unclear speech, but for clear voice recordings, even 16 kbps works.
Recommendation: 16-bit
Whisper processes audio as 16-bit PCM internally:
ffmpeg -i input.wav -c:a pcm_s16le output.wav24-bit or 32-bit recordings provide no transcription benefit and increase file size.
| Format | Pros | Cons |
|---|---|---|
| MP3 | Universal support, good compression | Slightly larger than modern codecs |
| M4A (AAC) | Better compression than MP3 | Slightly less universal |
| Opus | Best compression, low latency | Less tool support |
| WAV | No compression artifacts | Very large files |
Supported formats: m4a, mp3, webm, mp4, mpga, wav, mpeg
At recommended settings (mono, 16 kHz, 32 kbps MP3):
- 25 MB ≈ 100+ minutes of audio
At these settings, you're unlikely to hit the limit for typical voice notes or dictation sessions.
# Basic conversion: mono, 16kHz, 32kbps MP3
ffmpeg -i input.wav -ac 1 -ar 16000 -b:a 32k output.mp3
# M4A/AAC variant (slightly smaller files)
ffmpeg -i input.wav -ac 1 -ar 16000 -b:a 32k -c:a aac output.m4a
# Conservative quality (64kbps, 22kHz)
ffmpeg -i input.wav -ac 1 -ar 22050 -b:a 64k output.mp3If using arecord or similar:
# Record directly at Whisper-optimal settings
arecord -f S16_LE -r 16000 -c 1 output.wav# Convert all WAV files in directory
for f in *.wav; do
ffmpeg -i "$f" -ac 1 -ar 16000 -b:a 32k "${f%.wav}.mp3"
doneFrom testing by the community:
| Configuration | File Size (short clip) | API Response Time |
|---|---|---|
| 16 kbps, 12 kHz | 5 KB | 1.8 seconds |
| 32 kbps, 12 kHz | 9 KB | 1.8 seconds |
| 128 kbps, 24 kHz | 33 KB | 2.6 seconds |
Key insight: Over 50% latency reduction is achievable by optimizing audio format, with no measurable impact on transcription accuracy.
Most phone voice memo apps record at unnecessarily high quality. When configuring:
- Set to mono if your app allows
- Choose "voice" or "speech" preset over "music" presets
- Select lower quality/smaller file options
- If no settings available, convert before upload using the FFmpeg commands above
For speech-to-text with Whisper:
- Don't overthink it: Whisper is remarkably tolerant of compressed audio
- Mono is mandatory: Stereo is wasted bandwidth
- 16 kHz sample rate: Higher rates are downsampled anyway
- 32-64 kbps: The sweet spot for speech; much lower than studio defaults
- MP3 or M4A: Both work well; choose based on your toolchain
The rule of thumb: if you can understand it, Whisper probably can too.
- Optimise OpenAI Whisper API: Audio Format, Sampling Rate and Quality
- OpenAI Whisper GitHub Discussion: Optimal Sample Rate
- OpenAI Community: Minimum Bitrate for Whisper
- Compression Strategy for Whisper Files
- OpenAI Whisper API Limits
This gist was generated by Claude Code. Please validate the information against current OpenAI documentation, as API specifications may change over time.