Skip to main content

Blog

Why we use RNNoise for voice activity detection

Trimming silence and suppressing noise before recognition, not after.

28 Nov 2025

The recognizer doesn't need to hear your keyboard. Or the air conditioner. Or the three seconds of silence while you think.

We use RNNoise for voice activity detection. It's a recurrent neural network trained on noisy speech that does two things well: it figures out when you're actually talking, and it suppresses background noise without destroying consonants.

Why this matters for dictation

Most accuracy problems aren't model problems. They're input problems. Feed the recognizer clean speech segments, and it does better work. Feed it room noise and silence, and it hallucinates words or drops them.

RNNoise runs continuously while you hold the hotkey. It tracks speech vs non-speech in real time and gently pushes down ambient noise. The result: cleaner segments reach the Whisper-style model.

What we tried before

Simple energy-based VAD (volume threshold) fails in inconsistent environments. Loud typing triggers it. Quiet speech doesn't. We needed something that understood speech patterns, not just amplitude.

Heavy noise gates over-process the signal. They clip consonants and sibilants, making "s" sounds disappear and "t" sounds muddy. The transcript looks wrong even when the model heard correctly.

RNNoise hits a middle ground: aggressive enough to matter, gentle enough to preserve detail.

The pipeline

  1. Audio comes in at 48 kHz
  2. RNNoise detects speech regions and suppresses background
  3. We segment into phrases with padding to preserve context
  4. Audio conditioning (LUFS normalization, high-pass filter)
  5. Resample to 16 kHz for the model
  6. Transcribe

Steps 2 and 3 are where most of the "why does this sound cleaner" happens.


If you've used dictation in a noisy room and gotten garbage, the problem was probably upstream of the model. We fix it there.

Related articles