Series · 7 parts Published May 3, 2026

Website chat assistant

A serverless chat widget that lives on your website, answers visitors from your own knowledge in real time, hands the rest to a human cleanly, and turns every miss into a better answer next week. Seven posts on the same system — one diagram at a time — with an engineering reference at the end.

  1. 01

    A website chat assistant on AWS for a few dollars a month

    The whole system on one page — a gateway, an answerer, and a handoff-and-learning piece, plus the four moves they share for every visitor turn.

  2. 02

    How a conversation starts and stays alive

    Connect, exchange, idle. A websocket and a small scratchpad — that’s the whole shape. Short-term memory only.

  3. 03

    How the assistant answers

    Four tools, one pick per turn: answer, clarify, hand off, decline. No citation, no auto-answer.

  4. 04

    How a handoff to a human works

    Tell, package, deliver, hold. The visitor never repeats themselves to the human; the transcript is the ticket.

  5. 05

    How gaps become better answers

    A small clockwise loop — log every miss, group similar questions, write a paragraph, re-index automatically. Ten minutes a week makes month two’s assistant meaningfully better than month one’s.

  6. 06

    What the chat assistant costs

    A coffee a month at SMB volume. Cents per conversation, dominated by Bedrock tokens for the answerer.

  7. 07

    Engineering reference: the chat assistant architecture

    Same system, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, Knowledge Base wiring.

What does the website chat assistant do?
A small embedded widget that lives on your website, answers visitor questions from your own help docs, FAQ, and policies in real time over a websocket, hands the rest to a human with the full transcript attached, and quietly logs every question it couldn’t confidently answer so the assistant gets sharper every week.
How much does it cost?
About $3/month at typical small-business volume (a few hundred conversations a month). The fixed cost rounds to zero — quiet weeks bill nothing. The variable cost is cents per conversation, dominated by Bedrock Claude Haiku 4.5 tokens for the answerer. A typical SMB at 500 conversations a month lands well under five dollars total. A $10 monthly AWS Budgets alarm catches anything strange.
Which AWS services does it use?
Lambda (with Function URLs for token mint), API Gateway WebSocket API, DynamoDB on-demand, S3, EventBridge, SNS, SES outbound, Secrets Manager, CloudWatch Logs with seven-day retention, AWS Budgets, and Bedrock (Claude Haiku 4.5 via Global cross-Region inference, plus Titan Text Embeddings v2 with a Bedrock Knowledge Base backed by Amazon S3 Vectors). No NAT Gateway, no always-on compute.
How does it avoid making things up?
Search runs first, generation second. Every visitor turn queries the managed Knowledge Base before the model writes; only retrieved passages are in scope. The answerer uses strict tool_use over four tools (answer, clarify, hand_off, decline) and the answer tool requires a citation_id pointing at a retrieved passage. The runtime verifies the citation against the retrieved set before flushing — if the model cites a passage that wasn’t retrieved, the system downgrades to hand_off as the safer-by-default failure mode.
How does it hand off to a human?
Four steps in fixed order: tell the visitor with a realistic window and human language, package the transcript (full conversation, one-line AI-written summary, page URL, contact, reason for handoff), deliver to one destination (inbox, Slack via Amazon Q Developer, or shared queue — never two), and hold the websocket open for a couple more turns to catch follow-ups. The transcript is the ticket; the visitor never repeats themselves to the human.
Does it remember return visitors?
By default, no. Memory is short-term only and bound to one websocket session — the scratchpad holds the last few turns so follow-up questions resolve correctly, then expires when the session idles out. Long-term memory is opt-in: for signed-in customers who want their order history available, swap connection_id for an authenticated customer_id at $connect and bind the scratchpad to that identity.
How fast does it respond?
Words appear within a second. The answerer streams replies over the websocket via ApiGatewayManagementApi.PostToConnection, flushing partial responses every few tokens. There’s no spinner-and-wait — visitors see the reply being typed back the way a human typing would. WebSocket messages run about $1 per million, so the streaming itself is essentially free.
All posts