The recognizer doesn't need to hear your keyboard. Or the air conditioner. Or the three seconds of silence while you think.

We use RNNoise for voice activity detection. It's a recurrent neural network trained on noisy speech that does two things well: it figures out when you're actually talking, and it suppresses background noise without destroying consonants.

TL;DR

Dictation accuracy is often limited by input segmentation, not the recognizer.
RNNoise helps separate speech from noise and trims dead air before transcription.
Cleaner segments improve punctuation and reduce “hallucinated” filler words.

Why this matters for dictation

Most accuracy problems aren't model problems. They're input problems. Feed the recognizer clean speech segments, and it does better work. Feed it room noise and silence, and it hallucinates words or drops them.

RNNoise runs continuously while you hold the hotkey. It tracks speech vs non-speech in real time and gently pushes down ambient noise. The result: cleaner segments reach the Whisper-style model.

What we tried before

Simple energy-based VAD (volume threshold) fails in inconsistent environments. Loud typing triggers it. Quiet speech doesn't. We needed something that understood speech patterns, not just amplitude.

Heavy noise gates over-process the signal. They clip consonants and sibilants, making "s" sounds disappear and "t" sounds muddy. The transcript looks wrong even when the model heard correctly.

RNNoise hits a middle ground: aggressive enough to matter, gentle enough to preserve detail.

The pipeline

Audio comes in at 48 kHz
RNNoise detects speech regions and suppresses background
We segment into phrases with padding to preserve context
Audio conditioning (LUFS normalization, high-pass filter)
Resample to 16 kHz for the model
Transcribe

Steps 2 and 3 are where most of the "why does this sound cleaner" happens.

If you've used dictation in a noisy room and gotten garbage, the problem was probably upstream of the model. We fix it there.

Blog

Why we use RNNoise for voice activity detection

TL;DR

Why this matters for dictation

What we tried before

The pipeline

Related articles

Learn

Company