Engineering reference: the voice agent architecture
Same system as the rest of the series, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, and the actual flow operations — everything you’d need to recreate this in your own AWS account.
Posts 1–6 walk through the system in plain language. This page is the dense version — no softening, just the architecture as you’d sketch it on a whiteboard during a design review.
Fig 7. Full architecture, ap-southeast-1. White boxes = AWS resources; dashed AWS container; dashed grey boxes = subsystem groupings; dashed grey arrows = config feed and side branches.
Read this top-down, then column-by-column
Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Config Sync, then three runtime columns (Connect, Listener+Brain, Speaker), with a Cross-cutting strip at the bottom. The bidirectional audio arrow runs from the caller (top right) all the way down to Connect (bottom left) — carrying caller speech in and bot voice back out on the same channel. The dashed grey arrow from Config Sync to fn-brain shows the knowledge dependency — the brain reads the latest knowledge file from S3 on every turn.
Everything runs in ap-southeast-1 (Singapore) for low latency from the Philippines. Bedrock model invocations use the Global cross-Region inference profile (model IDs prefixed with global.) — data at rest stays in Singapore; inference may route to other regions for capacity. Pricing is the same as on-demand Singapore pricing.
The brain uses strict tool_use: four tool definitions (answer_from_knowledge, book_appointment, transfer_to_human, end_call) with required parameter schemas, so the model can only emit a structured tool call — not a free-text reply. Free text would let the model invent prices or promises; tool_use makes that mathematically impossible.
What’s deliberately not on the diagram
IAM policy details — per-Lambda execution role inline policies are minimal (one bucket prefix, one table, one Connect instance as appropriate).
Per-business knowledge schema — the knowledge.md file is a single Drive doc with sections for hours, services, FAQs, and tone. Updating sections updates the agent’s answers without a deploy.
X-Ray tracing — on for fn-brain and fn-speaker, sampling 100% during tuning, 10% in steady state. Latency is the design problem; tracing is non-negotiable here.
The CloudFormation parameters for the Bedrock model ID and the Polly voice are templated, so swapping voices or models doesn’t require code changes.
Connect AI agents — AWS’s built-in path with native ESCALATION and HANDOFF tools and managed Bedrock integration. Less code than the custom KVS+Lambda path here, at the cost of less control over tool schemas and retrieval. Worth picking when the four tools and bring-your-own knowledge file aren’t strictly required.
Amazon Nova 2 Sonic — Anthropic-style speech-to-speech via a single Bedrock model, GA December 2025. Collapses the Transcribe + Bedrock + Polly three-stage path into one call when it’s available in your region (currently us-east-1, us-west-2, ap-northeast-1; not yet in ap-southeast-1).
Live agent escalation via Connect Tasks — the natural follow-up to a transfer. Tasks now ship with AI-powered overviews and recommended next actions, so the human picking up doesn’t walk into a cold call.
If you’re recreating this
Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, claim a phone number on Connect and get a static greeting playing. Then a contact flow with the time-of-day check. Then the audio bridge into fn-call-orchestrator. Then the Listener + Brain on a single hard-coded tool. Then the Speaker. Then the other three tools. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.