Part 1 of 7 · Voice agent series ~5 min read

A voice agent on AWS for the price of a phone plan

Your business has a phone. Most of the time it rings outside hours, or while you’re with another customer, or about something simple (“what time do you close?”) that doesn’t really need you. Here’s how to build a small voice agent that picks up, answers from your own knowledge, and politely passes the rest to a human.

The whole system on one page

Before any code, here’s the shape of what we’re building.

System architecture: three outside surfaces, three inside AWS At the top, three external surfaces in a row. On the left, “The caller” — the person dialing your business number. In the middle, “Your knowledge” — opening hours, FAQ answers, and the tone you want the agent to use. On the right, “Your team” — where calls get transferred when they need a human. Each connects via an arrow to the AWS account container below. The caller has a two-way arrow labeled “audio” representing both speech in and bot voice out. Knowledge feeds into the brain. The brain transfers complex calls to your team. Inside the AWS account are three components in a row, mirroring the layout above. On the left, the Listener — turns the caller’s voice into text in real time. In the middle, the Brain — decides what to say or what to do. On the right, the Speaker — turns text back into a natural voice and plays it to the caller. Internal arrows flow left to right. A note at the bottom reads: listening, deciding, and speaking together fit in under a second — the conversation feels natural. The caller phones in Your knowledge hours, FAQs, tone Your team when a human is needed audio in & out guides transfer when needed AWS account Listener voice into text, in real time Brain decides what to say or what to do Speaker text back into a natural voice what they said what to say Listening, deciding, and speaking together fit in under a second — the conversation feels natural.
Fig 1. Three outside surfaces, three pieces inside AWS. Audio in, audio back, with a brain in the middle.

What you set up once (the outside)

  • A business phone number — a real number callers can dial. Existing numbers can be ported in if you already have one.
  • A short knowledge file — your opening hours, your most common questions and answers, and the tone you want the agent to use. Lives in a Google Doc you can edit anytime.
  • A way to reach a human — a number or queue the agent transfers to when a call needs you. Your mobile, your shop’s landline, a queue at your service desk — whichever fits.

What runs on every call (the inside)

  • The listener — turns the caller’s voice into text as they speak. Locks in the final version when they pause.
  • The brain — reads the caller’s words and decides one of four things: answer from the knowledge file, book the appointment, transfer to a human, or end the call gracefully.
  • The speaker — turns the brain’s reply back into a natural voice and plays it to the caller in real time.

In plain words

Someone calls. The cloud picks up. A small AI listens, decides, and replies in their voice — or hands the call to you. The whole loop takes under a second, so the caller doesn’t feel like they’re talking to a phone tree.

Total cost runs in phone-bill territory — a flat fee for the number, then a few cents per minute the line is in use.

Design rules that shaped every decision

  • Stay inside the AWS always-free quotas where possible. Voice has unavoidable per-minute costs, but the rest of the system stays free.
  • The agent answers from your knowledge file only — never invents prices, hours, or promises.
  • If the caller asks something the agent isn’t sure about, it transfers. It never bluffs.
  • The conversation has to feel real-time. If the agent can’t reply in under a second, it stalls naturally (“let me check that for you”) instead of going silent.
  • Configuration lives in a Drive doc you can edit. Updating tone or hours never needs a deploy.

Why this shape

Most “voice AI” tools collapse under one of three weights: a server bill that climbs every month, replies that confidently invent prices and hours, or a robot voice that makes callers hang up.

The architecture above is the smallest set of moving parts I could find that solves all three at once. One way in (your phone number), one way out (your team), three small pieces in the middle that listen, decide, and speak fast enough to feel natural. Everything else is plumbing.

The next five posts walk through each piece in turn — how a call connects, how the listener hears in real time, how the brain decides what to say, how the speaker stays natural, and what the whole thing actually costs. One diagram per post. A final engineering reference at the end gives engineers the dense version with precise service names and model IDs.

All posts