Voice agent
A serverless voice agent on AWS that answers your business phone, replies from your own knowledge in real time, and politely passes the rest to a human. Seven posts on the same system — one diagram at a time — with an engineering reference at the end.
-
01
A voice agent on AWS for the price of a phone plan
The whole system on one page — a listener, a brain, a speaker, and the under-one-second loop they share.
-
02
How a call connects
Three ways a call can go: voicemail after hours, AI session in business hours, or direct human transfer for VIP numbers.
-
03
How the listener hears
Streaming transcription, partial guesses refined live, locked the moment the caller pauses.
-
04
How the brain decides what to say
Four tools, one pick per turn: answer, book, transfer, end. The AI is allowed to be confident or to defer — never to invent.
-
05
How the speaker stays natural
The latency budget for a one-second-or-less reply. Where each millisecond goes, and what happens when the budget blows.
-
06
What the voice agent costs
Phone-bill territory at SMB volume. The phone number is the floor; everything else scales with how often it rings.
-
07
Engineering reference: the voice agent architecture
Same system, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs.
Frequently asked questions
- What does the voice agent do?
- It answers your business phone, replies from your own knowledge file in real time (hours, services, prices, FAQs), and politely transfers anything beyond its remit to a human. The brain has exactly four tools per turn — answer from the knowledge file, book an appointment, transfer to a human, or end the call gracefully — and never invents a price, hour, or promise.
- How much does it cost to run?
- About $35–$50 per month at typical SMB volume of around 200 call minutes. The phone number is the floor (~$22/month), Secrets Manager runs ~40¢ per secret, and per-minute costs (Connect inbound minutes, Transcribe streaming, Polly synthesis) plus per-call Bedrock Haiku invocations scale with how often the line rings. An $80 monthly Budget alarm catches anything weird.
- Which phone number provider does it use?
- Amazon Connect. You claim a DID directly on Connect (or port your existing number), and the call lands on a Contact Flow that greets, screens for time-of-day and VIP rules, and routes. Connect Audio Stream pipes both directions of audio through Kinesis Video Streams into the AI session.
- How does the agent avoid making things up?
- Strict tool_use with four tool definitions (
answer_from_knowledge,book_appointment,transfer_to_human,end_call) and required parameter schemas, so the model can only emit a structured tool call — never free text. The answer tool reads only from the knowledge file in S3 Vectors. If a fact isn’t in the file or the brain isn’t confident, it transfers rather than guessing. - What happens for after-hours calls?
- The Contact Flow checks the current time against your business hours from the knowledge file. Outside hours, the caller hears your closed-hours message and gets a chance to leave a voicemail with a callback number — the AI never spins up. You see the voicemail the next morning. After-hours is essentially free.
- How fast does the voice agent reply?
- Under one second from when the caller stops talking to when the bot starts speaking, end-to-end. The latency budget is ~100ms for the listener to lock the final transcript, ~400ms for the brain to pick a tool and write the reply, ~200ms for the speaker’s first byte of audio, ~150ms for network and audio buffer, and ~150ms spare. Real traffic typically lands around 1.0–1.2 seconds once cold-start and jitter are included.
- Which AWS services does it use?
- Amazon Connect (DID + Contact Flow), Kinesis Video Streams (two-way audio bridge), Lambda (Python 3.14 for the orchestrator, brain, speaker, booking, config sync, and archive), Amazon Transcribe Streaming, Amazon Bedrock (Claude Haiku 4.5 via Global cross-Region inference) with S3 Vectors for the knowledge index, Amazon Polly Generative voices via the Bidirectional Streaming API, DynamoDB on-demand, S3, EventBridge, SNS, Secrets Manager, CloudWatch Logs with seven-day retention, and AWS Budgets. No always-on compute, no NAT Gateway.