Part 7 of 7 · Voice agent series ~3 min read

Engineering reference: the voice agent architecture

Same system as the rest of the series, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, and the actual flow operations — everything you’d need to recreate this in your own AWS account.

Posts 1–6 walk through the system in plain language. This page is the dense version — no softening, just the architecture as you’d sketch it on a whiteboard during a design review.

Full technical architecture: serverless voice agent in ap-southeast-1 A detailed engineering diagram of the entire voice agent. Three external surfaces at the top: GitHub (repo and Actions runner, OIDC token requestor); Google Drive (config folder containing knowledge file with hours, FAQs, tone, with changes.watch push notifications); and PSTN / caller (the public phone network where the inbound call originates and the bot's voice is sent back). Everything runs in a single AWS account in region ap-southeast-1 (Singapore). The AWS account contains five subsystems. Build and Deploy strip at the top: GitHub Actions exchanges with IAM OIDC Provider, assumes an IAM Role with a trust policy scoped to repo:owner/repo:ref:main, and runs SAM/CloudFormation to update the voice-agent-prod stack. Config Sync strip below: a Lambda Function URL named fn-config-sync receives Drive changes.watch notifications, validates the knowledge file, and writes to S3 voice-agent-data/config/. Three runtime columns below. Connect (call entry and audio): Amazon Connect with a claimed DID handles the inbound call and runs a Contact Flow that greets, screens for time-of-day and VIP rules, and routes; Connect Audio Stream pipes both directions of audio through Kinesis Video Streams; Lambda fn-call-orchestrator manages the streaming session lifecycle. Listener and Brain (per caller turn): Amazon Transcribe Streaming produces partial and final transcripts; on each final utterance, Lambda fn-brain runs, retrieves relevant knowledge chunks from S3 Vectors, and invokes Bedrock global.anthropic.claude-haiku-4-5 with strict tool_use over four tools (answer, book, transfer, end). Speaker (audio back to caller): Lambda fn-speaker streams the brain's reply text to Amazon Polly Generative voices using the Polly Bidirectional Streaming API (launched March 2026, designed for low-latency LLM-to-TTS pipelines); the audio chunks are written back through Kinesis Video Streams into Connect, which plays them to the caller; side branches handle Connect transfer for human handoff and Lambda fn-booking which writes appointments to the operator's calendar via API. Cross-cutting bottom strip: DynamoDB tables tbl-call-logs and tbl-audit log every call and action; CloudWatch Logs are configured with RetentionInDays of 7 across every log group; SNS topics t-alarms and t-transferred-calls email the operator on failures and transfers; AWS Budgets has an $80 monthly alarm; Lambda fn-call-archive runs on a separate weekly cron 0 3 SUN to move old call audio recordings to S3 Glacier Instant Retrieval storage class. GitHub github.com/owner/repo Actions runner · OIDC token requestor Google Drive config folder · knowledge.md changes.watch push notifications PSTN / caller claimed DID, two-way audio inbound calls + bot voice back AWS Account Region: ap-southeast-1 (Singapore) · Bedrock via Global CRIS Build & Deploy IAM OIDC Provider token.actions.githubusercontent.com IAM Role trust: repo:owner/repo:ref:main SAM / CloudFormation stack: voice-agent-prod git push & request token AssumeRole sam deploy → creates stack resources below Config Sync Lambda Function URL fn-config-sync (validates) S3 voice-agent-data/config/knowledge.md changes.watch notification read by fn-brain on each turn Connect (call entry & audio) Amazon Connect instance + claimed DID Contact Flow greet, screen, route if AI session Kinesis Video Streams two-way audio bridge AWS Lambda fn-call-orchestrator (py3.14) → per-turn loop in next column two-way audio Listener + Brain (per caller turn) Transcribe Streaming partial + final transcripts on final AWS Lambda fn-brain (per utterance) retrieve S3 Vectors vec-knowledge (top-k chunks) InvokeModel Bedrock Haiku 4.5 global.anthropic.claude-haiku-4-5 strict tool_use: 4 tools → reply text streams to Speaker Speaker (audio back to caller) AWS Lambda fn-speaker (streams chunks) SynthesizeSpeech Amazon Polly Generative · Bidirectional Streaming Audio chunks → KVS back into Connect playback side branches Connect Transfer if tool = transfer AWS Lambda fn-booking (calendar API) → caller hears bot in <1s knowledge feeds fn-brain Cross-cutting DynamoDB tbl-call-logs, tbl-audit CloudWatch Logs RetentionInDays: 7 SNS t-alarms, t-transferred-calls AWS Budgets budget-monthly: $80 Lambda fn-call-archive EventBridge cron(0 3 ? * SUN *) → moves old call recordings to S3 Glacier Instant Retrieval
Fig 7. Full architecture, ap-southeast-1. White boxes = AWS resources; dashed AWS container; dashed grey boxes = subsystem groupings; dashed grey arrows = config feed and side branches.

Read this top-down, then column-by-column

Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Config Sync, then three runtime columns (Connect, Listener+Brain, Speaker), with a Cross-cutting strip at the bottom. The bidirectional audio arrow runs from the caller (top right) all the way down to Connect (bottom left) — carrying caller speech in and bot voice back out on the same channel. The dashed grey arrow from Config Sync to fn-brain shows the knowledge dependency — the brain reads the latest knowledge file from S3 on every turn.

Naming conventions used in the diagram

  • Lambda functions: fn-<purpose>fn-call-orchestrator, fn-brain, fn-speaker, fn-booking, fn-config-sync, fn-call-archive.
  • DynamoDB tables: tbl-call-logs, tbl-audit.
  • SNS topics: t-alarms for general failures, t-transferred-calls for human-handoff notifications.
  • S3 layout: single bucket voice-agent-data with prefixes config/, recordings/{date}/, archive/.
  • S3 Vectors index: vec-knowledge — chunked + embedded knowledge file for top-k retrieval.

Region and Bedrock model access

Everything runs in ap-southeast-1 (Singapore) for low latency from the Philippines. Bedrock model invocations use the Global cross-Region inference profile (model IDs prefixed with global.) — data at rest stays in Singapore; inference may route to other regions for capacity. Pricing is the same as on-demand Singapore pricing.

The brain uses strict tool_use: four tool definitions (answer_from_knowledge, book_appointment, transfer_to_human, end_call) with required parameter schemas, so the model can only emit a structured tool call — not a free-text reply. Free text would let the model invent prices or promises; tool_use makes that mathematically impossible.

What’s deliberately not on the diagram

  • IAM policy details — per-Lambda execution role inline policies are minimal (one bucket prefix, one table, one Connect instance as appropriate).
  • Per-business knowledge schema — the knowledge.md file is a single Drive doc with sections for hours, services, FAQs, and tone. Updating sections updates the agent’s answers without a deploy.
  • X-Ray tracing — on for fn-brain and fn-speaker, sampling 100% during tuning, 10% in steady state. Latency is the design problem; tracing is non-negotiable here.
  • The CloudFormation parameters for the Bedrock model ID and the Polly voice are templated, so swapping voices or models doesn’t require code changes.
  • Connect AI agents — AWS’s built-in path with native ESCALATION and HANDOFF tools and managed Bedrock integration. Less code than the custom KVS+Lambda path here, at the cost of less control over tool schemas and retrieval. Worth picking when the four tools and bring-your-own knowledge file aren’t strictly required.
  • Amazon Nova 2 Sonic — Anthropic-style speech-to-speech via a single Bedrock model, GA December 2025. Collapses the Transcribe + Bedrock + Polly three-stage path into one call when it’s available in your region (currently us-east-1, us-west-2, ap-northeast-1; not yet in ap-southeast-1).
  • Live agent escalation via Connect Tasks — the natural follow-up to a transfer. Tasks now ship with AI-powered overviews and recommended next actions, so the human picking up doesn’t walk into a cold call.

If you’re recreating this

Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, claim a phone number on Connect and get a static greeting playing. Then a contact flow with the time-of-day check. Then the audio bridge into fn-call-orchestrator. Then the Listener + Brain on a single hard-coded tool. Then the Speaker. Then the other three tools. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.

All posts