Part 5 of 7 · Voice agent series ~5 min read

How the speaker stays natural

Voice is unforgiving. A two-second pause feels like a lifetime, and a stilted reply makes the caller hang up. The speaker has a budget for every step, and a streaming voice that starts talking before the brain is done.

Latency budget: under one second from caller pauses to bot speaks A horizontal stacked bar showing the latency budget for one bot reply. The bar represents one second total from the moment the caller stops talking. The first segment, 100 milliseconds, is the listener locking the final transcript. The second segment, 400 milliseconds, is the brain reading the utterance, picking a tool, and writing the reply. The third segment, 200 milliseconds, is the speaker producing its first chunk of audio. The fourth segment, 150 milliseconds, is the network plus audio buffer that gets the chunk to the caller's phone and starts it playing. The remaining 150 milliseconds at the right end is labeled spare — the buffer that absorbs unusual delays. Tick marks below the bar at 0, 200, 400, 600, 800, and 1000 milliseconds. A bottom note reads: if the brain isn't ready in time, the system stalls naturally with "one moment" rather than going silent. From caller pauses to bot speaks Total budget: under 1 second · the “feels like a real conversation” threshold Listener locks final Brain decides picks tool, writes reply Speaker first byte of audio Network + buffer spare 100ms 400ms 200ms 150ms 150ms 0ms 200ms 400ms 600ms 800ms 1000ms caller pauses natural-feel limit Best case: ~850ms. Real-world traffic typically lands around 1.0–1.2 seconds once cold-start protection and network jitter are included. If the brain isn’t ready in time, the system stalls naturally with “one moment” — never goes silent.
Fig 5. The latency budget. Each step has a target; the spare at the end soaks up the inevitable jitter.

Why latency is the design problem

For text chat, a two-second reply is fine. Email is fine if you reply in an hour. Voice is different: a two-second pause feels like a lifetime to the caller. They’ll start to repeat themselves, ask “hello?”, or hang up.

The threshold for “feels like a real conversation” is around one second from when the caller stops talking to when they hear something back. Above that, the conversation breaks. The whole pipeline is designed around staying inside that one-second window on every reply.

The budget

One second sounds like plenty until you start carving it up:

  • Listener locks final — about a tenth of a second after the caller pauses, the listener locks in the transcript and hands it to the brain. (Most of this is the listener double-checking the caller really stopped, not just took a breath.)
  • Brain decides — the AI reads what the caller said, picks a tool, writes the reply. About four-tenths of a second.
  • Speaker first byte — the speaker turns the first chunk of text into audio. About two-tenths of a second.
  • Network and audio buffer — the audio chunk travels to the caller’s phone and starts playing. About a tenth and a half.

That’s 850 milliseconds, with 150 milliseconds of spare for when something is slower than usual. The budget is tight on purpose — if you let any one step take longer, another step has to make it up.

In practice, real traffic lands around 1 to 1.2 seconds once you account for the cloud waking up, network hiccups, and the occasional long question from the caller. That’s still inside the “feels like a real conversation” threshold for most callers — but only just. The diagram above is the target, not the average.

Streaming the voice back

The most important trick: the speaker doesn’t wait for the brain to finish writing the entire reply. The moment the brain writes the first sentence, the speaker starts synthesising it — while the brain is still thinking about the second sentence.

From the caller’s side, the bot starts talking almost immediately, and the rest of the reply just keeps flowing naturally. The actual synthesis and the brain’s thinking overlap in time.

Sounding like a person, not a phone tree

The voice itself matters. Old-school text-to-speech sounds robotic and formal — the kind of voice that makes callers say “agent” in frustration. Modern voices, properly chosen, can sound conversational, with natural pauses and inflection. The speaker uses the modern kind — it costs slightly more per character, but it’s the difference between a real conversation and a phone tree.

You also get to pick a voice that matches your business. A clinic might want a calm, warm voice. A restaurant might want a friendly, energetic one. The voice is part of the brand, the same way your logo is.

When the caller interrupts

Sometimes the caller starts talking while the bot is still mid-reply. Real conversations do this all the time. The pipeline detects new caller audio, stops the speaker, and hands the new words to the brain.

This is harder than it sounds — you have to tell the difference between “the caller cut me off” and “the caller said ‘mhm’ in agreement.” The listener watches for substance, not just sound. Short acknowledgements pass through; new questions stop the bot.

When the budget blows

Sometimes the brain takes too long — the question is unusual, the AI service is slow, the network has a hiccup. If the budget is about to blow, the system stalls gracefully with a short filler (“one moment, let me check that for you”) before going silent. The caller hears the system thinking, not the system frozen.

That filler buys another two seconds, which is usually enough. If even that doesn’t finish, the brain transfers to a human.

In plain words

Voice is harder than text because the caller is on the other end with no patience and no second screen. The speaker’s job isn’t just to read a reply — it’s to do it fast enough that the conversation feels natural. Every step has a budget; every step starts before the previous one is fully done. That’s the only way to fit a real conversation in under a second.

All posts