Engineering reference: the chat assistant architecture

Posts 1–6 walk through the system in plain language. This page is the dense version — nothing softened, just the architecture as you’d sketch it on a whiteboard during a design review.

Fig 7. Full architecture, ap-southeast-1. White boxes = AWS resources; dashed AWS container; dashed grey boxes = subsystem groupings; dashed grey arrows = config feed and side branches.

Read this top-down, then column-by-column

Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Knowledge Sync, then three runtime columns (Conversation gateway, Answerer, Handoff & learning), with a Cross-cutting strip at the bottom. A visitor opens the widget, the page calls fn-mint-token for a short-lived JWT, and connects to wss-chat. The $connect route writes a session row; each $default message invokes fn-ws-message which appends to the session scratchpad and invokes fn-answerer. The answerer issues a Bedrock RetrieveAndGenerate against kb-website-knowledge with strict tool_use over four tools (answer, clarify, hand_off, decline), streams the reply back via PostToConnection, and writes the turn into tbl-transcripts. hand_off invokes fn-handoff; clarify and decline append to tbl-gaps for weekly review.

Naming conventions used in the diagram

Lambda functions: fn-<purpose> — fn-mint-token, fn-ws-connect, fn-ws-message, fn-ws-disconnect, fn-answerer, fn-handoff, fn-gaps-batch, fn-archive.
DynamoDB tables: tbl-sessions (partition key connection_id, with a scratchpad list trimmed to the last few turns and a TTL on idle), tbl-transcripts (partition key session_id, sort key turn_index), tbl-gaps (partition key week_iso, sort key created_at#turn_id with the visitor turn, page URL, closest-passage scores).
SNS topics: t-handoffs for human-handoff fan-out (email, optional Slack), t-alarms for general failures.
S3 layout: single bucket chat-assistant-data with prefixes transcripts/{date}/, archive/.
Knowledge Base: kb-website-knowledge, a Bedrock managed Knowledge Base with a Drive connector pointed at the help/policies folder, embeddings model amazon.titan-embed-text-v2:0, vector store on Amazon OpenSearch Serverless (provisioned and managed by Bedrock when you create the KB).

Region, model access, websocket details, and Drive auth

Everything runs in ap-southeast-1 (Singapore). Bedrock model invocations use the Global cross-Region inference profile (global. prefix on model IDs) — data at rest stays in Singapore; inference may route to other regions for capacity, billed at on-demand Singapore rates.

The widget mints its session token from fn-mint-token rather than authenticating directly against API Gateway; the JWT is short-lived (a few minutes) and is checked in fn-ws-connect via a Lambda authorizer. This keeps long-lived secrets out of browsers entirely. Streaming replies use ApiGatewayManagementApi.PostToConnection with chunked writes — the answerer flushes partial responses every few tokens so the visitor sees words appear within a second.

Google Drive authentication uses a service account with domain-wide delegation over a single scope: https://www.googleapis.com/auth/drive.readonly on the help-docs folder only. The Bedrock Knowledge Base Drive connector consumes that credential out of AWS Secrets Manager. Editing a doc and saving triggers a re-sync within minutes; manual re-sync is one CLI call.

The answerer uses strict tool_use: four tool definitions (answer, clarify, hand_off, decline) with required parameter schemas. The answer tool requires a citation_id parameter referencing one of the retrieved passages by id; the runtime validates the citation against the retrieved set before allowing PostToConnection to flush. If the model emits an answer with a citation that wasn’t in the retrieved set, the runtime downgrades to hand_off — the safer-by-default failure mode.

What’s deliberately not on the diagram

IAM policy details — per-Lambda execution roles are minimal (one bucket prefix, one or two tables, a single Bedrock KB ID, InvokeModel on one model, execute-api:ManageConnections on one API).
Per-business knowledge layout — a flat Drive folder is fine for the first few months; subdivide by topic (shipping/, returns/, pricing/) once it grows past a couple of dozen docs, so writers know where new paragraphs go.
X-Ray tracing — on for fn-answerer and fn-handoff, sampling 100% during tuning, 10% in steady state.
Bedrock Guardrails contextual grounding check — managed grounding-and-relevance scoring. The custom citation-verification step in fn-answerer is roughly the same idea hand-rolled; turning on Guardrails moves the threshold into console configuration and adds PII redaction on every model call. Worth enabling once thresholds are stable.
Long-lived visitor identity — for logged-in customers who want their order history available, swap connection_id for an authenticated customer_id at $connect and bind the scratchpad to that identity. Keep it opt-in.
Multi-tenant variant — if running this on behalf of multiple SMBs, namespace the KB and tables per tenant and inject tenant_id into every record. The architecture doesn’t change shape; the IDs do.
Slack two-way handoff — the diagram fans out to Slack as a notification only. A bidirectional Slack-to-visitor reply path (agent types in Slack, visitor sees it in the widget) is an additional Lambda + Slack Events Subscription; off the default diagram to keep the per-message cost in the always-free band.

If you’re recreating this

Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, create the Bedrock Knowledge Base with one Drive doc and confirm a one-shot RetrieveAndGenerate call returns a passage. Then the WebSocket API with stub $connect/$default/$disconnect handlers that just echo back. Then the real fn-answerer with strict tool_use and citation verification (this is the part most worth integration-testing — intentionally try to make the model cite a passage outside the retrieved set and confirm the runtime downgrades to hand_off). Then the handoff fan-out and the gaps log. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.

All posts