Website chat assistant
A serverless chat widget that lives on your website, answers visitors from your own knowledge in real time, hands the rest to a human cleanly, and turns every miss into a better answer next week. Seven posts on the same system — one diagram at a time — with an engineering reference at the end.
-
01
A website chat assistant on AWS for a few dollars a month
The whole system on one page — a gateway, an answerer, and a handoff-and-learning piece, plus the four moves they share for every visitor turn.
-
02
How a conversation starts and stays alive
Connect, exchange, idle. A websocket and a small scratchpad — that’s the whole shape. Short-term memory only.
-
03
How the assistant answers
Four tools, one pick per turn: answer, clarify, hand off, decline. No citation, no auto-answer.
-
04
How a handoff to a human works
Tell, package, deliver, hold. The visitor never repeats themselves to the human; the transcript is the ticket.
-
05
How gaps become better answers
A small clockwise loop — log every miss, group similar questions, write a paragraph, re-index automatically. Ten minutes a week makes month two’s assistant meaningfully better than month one’s.
-
06
What the chat assistant costs
A coffee a month at SMB volume. Cents per conversation, dominated by Bedrock tokens for the answerer.
-
07
Engineering reference: the chat assistant architecture
Same system, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, Knowledge Base wiring.
Frequently asked questions
- What does the website chat assistant do?
- A small embedded widget that lives on your website, answers visitor questions from your own help docs, FAQ, and policies in real time over a websocket, hands the rest to a human with the full transcript attached, and quietly logs every question it couldn’t confidently answer so the assistant gets sharper every week.
- How much does it cost?
- About $3/month at typical small-business volume (a few hundred conversations a month). The fixed cost rounds to zero — quiet weeks bill nothing. The variable cost is cents per conversation, dominated by Bedrock Claude Haiku 4.5 tokens for the answerer. A typical SMB at 500 conversations a month lands well under five dollars total. A $10 monthly AWS Budgets alarm catches anything strange.
- Which AWS services does it use?
- Lambda (with Function URLs for token mint), API Gateway WebSocket API, DynamoDB on-demand, S3, EventBridge, SNS, SES outbound, Secrets Manager, CloudWatch Logs with seven-day retention, AWS Budgets, and Bedrock (Claude Haiku 4.5 via Global cross-Region inference, plus Titan Text Embeddings v2 with a Bedrock Knowledge Base backed by Amazon S3 Vectors). No NAT Gateway, no always-on compute.
- How does it avoid making things up?
- Search runs first, generation second. Every visitor turn queries the managed Knowledge Base before the model writes; only retrieved passages are in scope. The answerer uses strict tool_use over four tools (
answer,clarify,hand_off,decline) and theanswertool requires acitation_idpointing at a retrieved passage. The runtime verifies the citation against the retrieved set before flushing — if the model cites a passage that wasn’t retrieved, the system downgrades tohand_offas the safer-by-default failure mode. - How does it hand off to a human?
- Four steps in fixed order: tell the visitor with a realistic window and human language, package the transcript (full conversation, one-line AI-written summary, page URL, contact, reason for handoff), deliver to one destination (inbox, Slack via Amazon Q Developer, or shared queue — never two), and hold the websocket open for a couple more turns to catch follow-ups. The transcript is the ticket; the visitor never repeats themselves to the human.
- Does it remember return visitors?
- By default, no. Memory is short-term only and bound to one websocket session — the scratchpad holds the last few turns so follow-up questions resolve correctly, then expires when the session idles out. Long-term memory is opt-in: for signed-in customers who want their order history available, swap
connection_idfor an authenticatedcustomer_idat$connectand bind the scratchpad to that identity. - How fast does it respond?
- Words appear within a second. The answerer streams replies over the websocket via
ApiGatewayManagementApi.PostToConnection, flushing partial responses every few tokens. There’s no spinner-and-wait — visitors see the reply being typed back the way a human typing would. WebSocket messages run about $1 per million, so the streaming itself is essentially free.