Part 3 of 7 · Voice agent series ~5 min read

How the listener hears

The listener’s job is to turn the caller’s voice into text the brain can read. It does this not when they finish, but as they’re talking — partial guesses get refined every fraction of a second, and when the caller pauses, the listener locks in the final version.

Streaming transcription: partial guesses refined live, locked on pause A vertical flow with a side panel showing live transcription progress. At the top, “Caller speaking” representing the audio stream entering the system. An arrow leads down to a Streaming transcriber box that produces guesses continuously and refines them every fraction of a second. To its right, a side panel shows a sequence of partial transcripts evolving in monospace font: “what...” then “what time...” then “what time do you...” then “what time do you close” — each one a refined guess as more audio arrives. Below the listener, an arrow leads to a Pause detection box, where the system waits for the caller to stop talking; once they pause for a moment, the listener locks in the final version. From there, an arrow leads to a final box labeled “Forward to Brain” with the locked utterance text. A bottom note reads: the listener produces guesses continuously; the brain only sees the final version. Caller speaking audio streams in Streaming transcriber guesses what’s being said, refines every fraction of a second Live guesses, refining > what... > what time... > what time do you... > what time do you close while talking Caller pauses transcriber locks the final version Forward to Brain final utterance text The transcriber produces guesses continuously. The brain only sees the final version.
Fig 3. The listener watches the audio stream live and refines its guess as it goes; the brain only sees the locked-in final.

Streaming, not batched

The naive way to transcribe a phone call is to wait until the caller finishes speaking, send the whole audio chunk to a transcription service, get the text back, and then start thinking. That works, but the round trip is slow — by the time the caller hears anything, the conversation feels like a phone tree.

The listener works the other way around. As the caller starts talking, the audio is sent in tiny chunks to a transcription service that produces guesses on the fly. The text starts appearing within a fraction of a second of the first word.

Why partial guesses matter

Each partial guess is the listener’s best attempt at what’s been said so far — subject to revision. Consider a caller saying “what time do you close”:

  • The first guess might be just “what.”
  • A moment later, “what time.”
  • Then “what time do you...”
  • Then “what time do you close.”

Each new guess is sharper than the last because the listener has more audio to work with. The listener doesn’t hand any of these to the brain — they’re drafts. The handoff happens later.

Knowing when the caller has stopped

The trickiest part of voice is knowing when it’s your turn to speak. People pause mid-sentence to think. They say “um.” They take a breath. The listener has to tell the difference between “I’m thinking” and “I’m done.”

The way it does this: it watches the audio for a short stretch of silence (a few hundred milliseconds — tunable per caller). When the silence is long enough to count as a real pause, the listener locks in the latest guess as final and hands the text to the brain. From here on, the brain has to be fast.

What about background noise?

The transcription service is trained to ignore everything that isn’t a human voice — traffic, fans, music, dogs. It also handles a wide range of accents, microphone qualities, and call quality levels you’d expect from a real-world phone line. It’s not perfect, but it’s good enough for the kind of short questions a caller actually asks.

What about multiple languages?

You tell the system which language to expect (or list a small set of likely languages, and it picks). For mixed-language regions like the Philippines, callers often switch between languages mid-sentence; the listener can handle that within reason — the bot will understand the question even when half of it is in Tagalog. The reply, however, only comes back in whichever voice you’ve set for the speaker (typically a Singapore-English voice for the Philippines), so the agent answers in English even when the caller code-switches. If that’s a deal-breaker for your audience, the brain falls back to a polite “let me transfer you to someone who can help.”

In plain words

The listener never makes the brain wait. It transcribes as the caller talks, refines its guess until the caller pauses, then hands the final version off in less than a tenth of a second. From here, the clock is on the brain to reply — and that’s the next post.

All posts