How the speaker stays natural

Fig 5. The latency budget. Each step has a target; the spare at the end soaks up the inevitable jitter.

Why latency is the design problem

For text chat, a two-second reply is fine. Email is fine if you reply in an hour. Voice is different: a two-second pause feels like a lifetime to the caller. They’ll start to repeat themselves, ask “hello?”, or hang up.

The threshold for “feels like a real conversation” is around one second from when the caller stops talking to when they hear something back. Above that, the conversation breaks. The whole pipeline is designed around staying inside that one-second window on every reply.

The budget

One second sounds like plenty until you start carving it up:

Listener locks final — about a tenth of a second after the caller pauses, the listener locks in the transcript and hands it to the brain. (Most of this is the listener double-checking the caller really stopped, not just took a breath.)
Brain decides — the AI reads what the caller said, picks a tool, writes the reply. About four-tenths of a second.
Speaker first byte — the speaker turns the first chunk of text into audio. About two-tenths of a second.
Network and audio buffer — the audio chunk travels to the caller’s phone and starts playing. About a tenth and a half.

That’s 850 milliseconds, with 150 milliseconds of spare for when something is slower than usual. The budget is tight on purpose — if you let any one step take longer, another step has to make it up.

In practice, real traffic lands around 1 to 1.2 seconds once you account for the cloud waking up, network hiccups, and the occasional long question from the caller. That’s still inside the “feels like a real conversation” threshold for most callers — but only just. The diagram above is the target, not the average.

Streaming the voice back

The most important trick: the speaker doesn’t wait for the brain to finish writing the entire reply. The moment the brain writes the first sentence, the speaker starts synthesising it — while the brain is still thinking about the second sentence.

From the caller’s side, the bot starts talking almost immediately, and the rest of the reply just keeps flowing naturally. The actual synthesis and the brain’s thinking overlap in time.

Sounding like a person, not a phone tree

The voice itself matters. Old-school text-to-speech sounds robotic and formal — the kind of voice that makes callers say “agent” in frustration. Modern voices, properly chosen, can sound conversational, with natural pauses and inflection. The speaker uses the modern kind — it costs slightly more per character, but it’s the difference between a real conversation and a phone tree.

You also get to pick a voice that matches your business. A clinic might want a calm, warm voice. A restaurant might want a friendly, energetic one. The voice is part of the brand, the same way your logo is.

When the caller interrupts

Sometimes the caller starts talking while the bot is still mid-reply. Real conversations do this all the time. The pipeline detects new caller audio, stops the speaker, and hands the new words to the brain.

This is harder than it sounds — you have to tell the difference between “the caller cut me off” and “the caller said ‘mhm’ in agreement.” The listener watches for substance, not just sound. Short acknowledgements pass through; new questions stop the bot.

When the budget blows

Sometimes the brain takes too long — the question is unusual, the AI service is slow, the network has a hiccup. If the budget is about to blow, the system stalls gracefully with a short filler (“one moment, let me check that for you”) before going silent. The caller hears the system thinking, not the system frozen.

That filler buys another two seconds, which is usually enough. If even that doesn’t finish, the brain transfers to a human.

In plain words

Voice is harder than text because the caller is on the other end with no patience and no second screen. The speaker’s job isn’t just to read a reply — it’s to do it fast enough that the conversation feels natural. Every step has a budget; every step starts before the previous one is fully done. That’s the only way to fit a real conversation in under a second.

All posts