How the listener hears

Fig 3. The listener watches the audio stream live and refines its guess as it goes; the brain only sees the locked-in final.

Streaming, not batched

The naive way to transcribe a phone call is to wait until the caller finishes speaking, send the whole audio chunk to a transcription service, get the text back, and then start thinking. That works, but the round trip is slow — by the time the caller hears anything, the conversation feels like a phone tree.

The listener works the other way around. As the caller starts talking, the audio is sent in tiny chunks to a transcription service that produces guesses on the fly. The text starts appearing within a fraction of a second of the first word.

Why partial guesses matter

Each partial guess is the listener’s best attempt at what’s been said so far — subject to revision. Consider a caller saying “what time do you close”:

The first guess might be just “what.”
A moment later, “what time.”
Then “what time do you...”
Then “what time do you close.”

Each new guess is sharper than the last because the listener has more audio to work with. The listener doesn’t hand any of these to the brain — they’re drafts. The handoff happens later.

Knowing when the caller has stopped

The trickiest part of voice is knowing when it’s your turn to speak. People pause mid-sentence to think. They say “um.” They take a breath. The listener has to tell the difference between “I’m thinking” and “I’m done.”

The way it does this: it watches the audio for a short stretch of silence (a few hundred milliseconds — tunable per caller). When the silence is long enough to count as a real pause, the listener locks in the latest guess as final and hands the text to the brain. From here on, the brain has to be fast.

What about background noise?

The transcription service is trained to ignore everything that isn’t a human voice — traffic, fans, music, dogs. It also handles a wide range of accents, microphone qualities, and call quality levels you’d expect from a real-world phone line. It’s not perfect, but it’s good enough for the kind of short questions a caller actually asks.

What about multiple languages?

You tell the system which language to expect (or list a small set of likely languages, and it picks). For mixed-language regions like the Philippines, callers often switch between languages mid-sentence; the listener can handle that within reason — the bot will understand the question even when half of it is in Tagalog. The reply, however, only comes back in whichever voice you’ve set for the speaker (typically a Singapore-English voice for the Philippines), so the agent answers in English even when the caller code-switches. If that’s a deal-breaker for your audience, the brain falls back to a polite “let me transfer you to someone who can help.”

In plain words

The listener never makes the brain wait. It transcribes as the caller talks, refines its guess until the caller pauses, then hands the final version off in less than a tenth of a second. From here, the clock is on the brain to reply — and that’s the next post.

All posts