A voice agent on AWS for the price of a phone plan

The whole system on one page

Before any code, here’s the shape of what we’re building.

Fig 1. Three outside surfaces, three pieces inside AWS. Audio in, audio back, with a brain in the middle.

What you set up once (the outside)

A business phone number — a real number callers can dial. Existing numbers can be ported in if you already have one.
A short knowledge file — your opening hours, your most common questions and answers, and the tone you want the agent to use. Lives in a Google Doc you can edit anytime.
A way to reach a human — a number or queue the agent transfers to when a call needs you. Your mobile, your shop’s landline, a queue at your service desk — whichever fits.

What runs on every call (the inside)

The listener — turns the caller’s voice into text as they speak. Locks in the final version when they pause.
The brain — reads the caller’s words and decides one of four things: answer from the knowledge file, book the appointment, transfer to a human, or end the call gracefully.
The speaker — turns the brain’s reply back into a natural voice and plays it to the caller in real time.

In plain words

Someone calls. The cloud picks up. A small AI listens, decides, and replies in their voice — or hands the call to you. The whole loop takes under a second, so the caller doesn’t feel like they’re talking to a phone tree.

Total cost runs in phone-bill territory — a flat fee for the number, then a few cents per minute the line is in use.

Design rules that shaped every decision

Stay inside the AWS always-free quotas where possible. Voice has unavoidable per-minute costs, but the rest of the system stays free.
The agent answers from your knowledge file only — never invents prices, hours, or promises.
If the caller asks something the agent isn’t sure about, it transfers. It never bluffs.
The conversation has to feel real-time. If the agent can’t reply in under a second, it stalls naturally (“let me check that for you”) instead of going silent.
Configuration lives in a Drive doc you can edit. Updating tone or hours never needs a deploy.

Why this shape

Most “voice AI” tools collapse under one of three weights: a server bill that climbs every month, replies that confidently invent prices and hours, or a robot voice that makes callers hang up.

The architecture above is the smallest set of moving parts I could find that solves all three at once. One way in (your phone number), one way out (your team), three small pieces in the middle that listen, decide, and speak fast enough to feel natural. Everything else is plumbing.

The next five posts walk through each piece in turn — how a call connects, how the listener hears in real time, how the brain decides what to say, how the speaker stays natural, and what the whole thing actually costs. One diagram per post. A final engineering reference at the end gives engineers the dense version with precise service names and model IDs.

All posts