danielrosehill/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Optimal Audio Input Settings for OpenAI Whisper Speech-to-Text

This guide documents the recommended audio parameters for capturing voice input destined for OpenAI's Whisper API or local Whisper models. The goal is to minimize file size while preserving transcription accuracy.
TL;DR - Recommended Settings


Parameter
Recommended
Notes


Channels
Mono (1)
Whisper converts to mono internally anyway


Sample Rate
16 kHz
Native processing rate; higher is downsampled


Bitrate
32-64 kbps
Sweet spot for speech; 16 kbps still works


Bit Depth
16-bit
Standard for speech


Format
MP3 or M4A (AAC)
Best compression for speech


Detailed Parameter Guide

Mono vs Stereo

Recommendation: Always use Mono
Whisper internally converts all audio to mono using the equivalent of:
ffmpeg -i input.wav -ac 1 output.wav
Recording in stereo doubles file size with zero benefit for transcription. If your microphone or recording software defaults to stereo, convert to mono before upload or configure your recording chain to capture mono directly.
Sample Rate

Recommendation: 16 kHz (or let Whisper downsample)
Whisper resamples all audio to 16 kHz before processing:
ffmpeg -i input.wav -ar 16000 output.wav


Original Rate
File Size Impact
Transcription Quality


48 kHz
Largest
No benefit


44.1 kHz
Large
No benefit


22.05 kHz
Medium
Equivalent


16 kHz
Smallest
Equivalent


If you're recording specifically for Whisper, capturing at 16 kHz saves processing time and storage. However, if you want to archive high-quality originals, record at your mic's native rate and downsample copies for transcription.
Bitrate (Compressed Formats)

Recommendation: 32-64 kbps for speech
This is the most impactful parameter for file size optimization. Studio microphones often default to 128-320 kbps, which is massive overkill for speech transcription.


Bitrate
File Size (1 min)
Transcription Quality
Notes


128 kbps
~960 KB
Excellent
Overkill for speech


64 kbps
~480 KB
Excellent
Safe conservative choice


32 kbps
~240 KB
Excellent
Recommended minimum


16 kbps
~120 KB
Good
Works, but at the edge


Key finding: Transcription accuracy remains consistent down to 32 kbps at 12 kHz. Below this, you may encounter edge cases with unclear speech, but for clear voice recordings, even 16 kbps works.
Bit Depth

Recommendation: 16-bit
Whisper processes audio as 16-bit PCM internally:
ffmpeg -i input.wav -c:a pcm_s16le output.wav
24-bit or 32-bit recordings provide no transcription benefit and increase file size.

File Format Recommendations

For API Upload (25 MB limit)


Format
Pros
Cons


MP3
Universal support, good compression
Slightly larger than modern codecs


M4A (AAC)
Better compression than MP3
Slightly less universal


Opus
Best compression, low latency
Less tool support


WAV
No compression artifacts
Very large files


Supported formats: m4a, mp3, webm, mp4, mpga, wav, mpeg
Handling the 25 MB Limit

At recommended settings (mono, 16 kHz, 32 kbps MP3):

25 MB ≈ 100+ minutes of audio

At these settings, you're unlikely to hit the limit for typical voice notes or dictation sessions.

FFmpeg Commands

Convert existing audio to Whisper-optimized format

# Basic conversion: mono, 16kHz, 32kbps MP3
ffmpeg -i input.wav -ac 1 -ar 16000 -b:a 32k output.mp3

# M4A/AAC variant (slightly smaller files)
ffmpeg -i input.wav -ac 1 -ar 16000 -b:a 32k -c:a aac output.m4a

# Conservative quality (64kbps, 22kHz)
ffmpeg -i input.wav -ac 1 -ar 22050 -b:a 64k output.mp3
Configure recording software

If using arecord or similar:
# Record directly at Whisper-optimal settings
arecord -f S16_LE -r 16000 -c 1 output.wav
Batch conversion

# Convert all WAV files in directory
for f in *.wav; do
  ffmpeg -i "$f" -ac 1 -ar 16000 -b:a 32k "${f%.wav}.mp3"
done

Real-World Benchmarks

From testing by the community:


Configuration
File Size (short clip)
API Response Time


16 kbps, 12 kHz
5 KB
1.8 seconds


32 kbps, 12 kHz
9 KB
1.8 seconds


128 kbps, 24 kHz
33 KB
2.6 seconds


Key insight: Over 50% latency reduction is achievable by optimizing audio format, with no measurable impact on transcription accuracy.

Phone/Mobile Recording Tips

Most phone voice memo apps record at unnecessarily high quality. When configuring:

Set to mono if your app allows
Choose "voice" or "speech" preset over "music" presets
Select lower quality/smaller file options
If no settings available, convert before upload using the FFmpeg commands above


Summary

For speech-to-text with Whisper:

Don't overthink it: Whisper is remarkably tolerant of compressed audio
Mono is mandatory: Stereo is wasted bandwidth
16 kHz sample rate: Higher rates are downsampled anyway
32-64 kbps: The sweet spot for speech; much lower than studio defaults
MP3 or M4A: Both work well; choose based on your toolchain

The rule of thumb: if you can understand it, Whisper probably can too.

References


Optimise OpenAI Whisper API: Audio Format, Sampling Rate and Quality
OpenAI Whisper GitHub Discussion: Optimal Sample Rate
OpenAI Community: Minimum Bitrate for Whisper
Compression Strategy for Whisper Files
OpenAI Whisper API Limits


This gist was generated by Claude Code. Please validate the information against current OpenAI documentation, as API specifications may change over time.
Parameter	Recommended	Notes
Channels	Mono (1)	Whisper converts to mono internally anyway
Sample Rate	16 kHz	Native processing rate; higher is downsampled
Bitrate	32-64 kbps	Sweet spot for speech; 16 kbps still works
Bit Depth	16-bit	Standard for speech
Format	MP3 or M4A (AAC)	Best compression for speech
Original Rate	File Size Impact	Transcription Quality
48 kHz	Largest	No benefit
44.1 kHz	Large	No benefit
22.05 kHz	Medium	Equivalent
16 kHz	Smallest	Equivalent
Bitrate	File Size (1 min)	Transcription Quality	Notes
128 kbps	~960 KB	Excellent	Overkill for speech
64 kbps	~480 KB	Excellent	Safe conservative choice
32 kbps	~240 KB	Excellent	Recommended minimum
16 kbps	~120 KB	Good	Works, but at the edge
Format	Pros	Cons
MP3	Universal support, good compression	Slightly larger than modern codecs
M4A (AAC)	Better compression than MP3	Slightly less universal
Opus	Best compression, low latency	Less tool support
WAV	No compression artifacts	Very large files
Configuration	File Size (short clip)	API Response Time
16 kbps, 12 kHz	5 KB	1.8 seconds
32 kbps, 12 kHz	9 KB	1.8 seconds
128 kbps, 24 kHz	33 KB	2.6 seconds