Your Mac's microphone captures audio at 48 kHz. Whisper-style models expect 16 kHz. The conversion matters more than you'd think.
TL;DR
- Bad resampling introduces aliasing artifacts the model can “hear.”
- Band-limited resampling keeps consonants intact while preventing fold-back noise.
- Resample after upstream steps (VAD/noise suppression), not before.
The naive approach breaks things
Simple decimation (just dropping samples) creates aliasing artifacts. High frequencies fold back into the audible range as noise. The model hears phantom sounds that weren't in your voice.
Cheap resampling libraries optimize for speed, not quality. Fine for ringtones. Bad for speech recognition where subtle differences between consonants matter.
Band-limited resampling
We use band-limited resampling: apply a low-pass filter at the Nyquist frequency (8 kHz), then decimate. This removes frequencies that would alias before they can cause problems.
The filter matters. Too aggressive and you lose the high-frequency content that distinguishes "s" from "f" from "th". Too gentle and aliasing sneaks through.
Why not just record at 16 kHz?
macOS audio APIs default to 48 kHz. Fighting the system adds latency and edge cases. Better to accept 48 kHz and resample correctly.
Plus, we process at 48 kHz for the earlier pipeline stages (VAD, noise suppression). Higher sample rate means more information to work with when detecting speech boundaries.
The full chain
- Capture at 48 kHz (Mac default)
- RNNoise VAD + noise suppression (48 kHz)
- Segmentation into phrases (48 kHz)
- LUFS normalization + high-pass filter (48 kHz)
- Band-limited resample to 16 kHz
- Feed to Whisper-style model
Related: RNNoise VAD · Technology overview
Each step operates at the sample rate that makes sense for it. The model gets clean 16 kHz audio that matches its training distribution.
Small details compound. A 1% improvement in each pipeline stage adds up to noticeably better transcripts.
Related articles
An ergonomic split keyboard can improve wrist and shoulder posture — but it won’t fix typing volume. Here’s what split keyboard ergonomics really change, what studies suggest, and when voice typing helps more.
Linear productivity is mostly about clarity: fewer meetings, fewer follow-ups, and issues that are easy to execute. Here’s a practical workflow (with templates) that makes teams faster.
