Engineering reference: the voice agent architecture
Same system as the rest of the series, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, and the actual flow operations — everything you’d need to recreate this in your own AWS account.
Key takeaways
- Single AWS account in
ap-southeast-1(Singapore); Bedrock via Global cross-Region inference. - Five subsystems: Build & Deploy, Config Sync, Connect (call entry + audio), Listener + Brain (per turn), Speaker.
- Audio bridge: Amazon Connect ↔ Kinesis Video Streams ↔
fn-call-orchestrator; Polly Bidirectional Streaming API (March 2026) drives the speaker. - Model:
global.anthropic.claude-haiku-4-5-20251001-v1:0with strict tool_use over four tools (answer_from_knowledge,book_appointment,transfer_to_human,end_call); knowledge retrieval viavec-knowledgeon S3 Vectors. - Alternatives noted but off-path: Connect AI agents (less code, less control) and Amazon Nova 2 Sonic (single-call speech-to-speech, not yet in
ap-southeast-1).
Posts 1–6 walk through the system in plain language. This page is the dense version — no softening, just the architecture as you’d sketch it on a whiteboard during a design review.
Read this top-down, then column-by-column
Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Config Sync, then three runtime columns (Connect, Listener+Brain, Speaker), with a Cross-cutting strip at the bottom. The bidirectional audio arrow runs from the caller (top right) all the way down to Connect (bottom left) — carrying caller speech in and bot voice back out on the same channel. The dashed grey arrow from Config Sync to fn-brain shows the knowledge dependency — the brain reads the latest knowledge file from S3 on every turn.
Naming conventions used in the diagram
- Lambda functions:
fn-<purpose>—fn-call-orchestrator,fn-brain,fn-speaker,fn-booking,fn-config-sync,fn-call-archive. - DynamoDB tables:
tbl-call-logs,tbl-audit. - SNS topics:
t-alarmsfor general failures,t-transferred-callsfor human-handoff notifications. - S3 layout: single bucket
voice-agent-datawith prefixesconfig/,recordings/{date}/,archive/. - S3 Vectors index:
vec-knowledge— chunked + embedded knowledge file for top-k retrieval.
Region and Bedrock model access
Everything runs in ap-southeast-1 (Singapore) for low latency from the Philippines. Bedrock model invocations use the Global cross-Region inference profile (model IDs prefixed with global.) — data at rest stays in Singapore; inference may route to other regions for capacity. Pricing is the same as on-demand Singapore pricing.
The brain uses strict tool_use: four tool definitions (answer_from_knowledge, book_appointment, transfer_to_human, end_call) with required parameter schemas, so the model can only emit a structured tool call — not a free-text reply. Free text would let the model invent prices or promises; tool_use makes that mathematically impossible.
What’s deliberately not on the diagram
- IAM policy details — per-Lambda execution role inline policies are minimal (one bucket prefix, one table, one Connect instance as appropriate).
- Per-business knowledge schema — the
knowledge.mdfile is a single Drive doc with sections for hours, services, FAQs, and tone. Updating sections updates the agent’s answers without a deploy. - X-Ray tracing — on for
fn-brainandfn-speaker, sampling 100% during tuning, 10% in steady state. Latency is the design problem; tracing is non-negotiable here. - The CloudFormation parameters for the Bedrock model ID and the Polly voice are templated, so swapping voices or models doesn’t require code changes.
- Connect AI agents — AWS’s built-in path with native
ESCALATIONandHANDOFFtools and managed Bedrock integration. Less code than the custom KVS+Lambda path here, at the cost of less control over tool schemas and retrieval. Worth picking when the four tools and bring-your-own knowledge file aren’t strictly required. - Amazon Nova 2 Sonic — Anthropic-style speech-to-speech via a single Bedrock model, GA December 2025. Collapses the Transcribe + Bedrock + Polly three-stage path into one call when it’s available in your region (currently us-east-1, us-west-2, ap-northeast-1; not yet in ap-southeast-1).
- Live agent escalation via Connect Tasks — the natural follow-up to a transfer. Tasks now ship with AI-powered overviews and recommended next actions, so the human picking up doesn’t walk into a cold call.
If you’re recreating this
Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, claim a phone number on Connect and get a static greeting playing. Then a contact flow with the time-of-day check. Then the audio bridge into fn-call-orchestrator. Then the Listener + Brain on a single hard-coded tool. Then the Speaker. Then the other three tools. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.