Engineering reference: the ticket router architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, the SQS queue config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, SQS, and Lambda Function URLs are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is a ticket sitting in the wrong queue for an hour, not a regional outage. One AWS account dedicated to the router (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
intake-mail— S3 PUT trigger ons3://tr-raw-mail/. Parses the MIME, extracts sender/subject/body, strips quoted reply history, matches thread headers against open tickets intr-ticketsto merge replies, writes a new ticket (or appends to an existing one), and sends the ticket id to thetr-intakeSQS queue. Memory: 256 MB. Timeout: 30 s.intake-form— Lambda Function URL,AuthType: NONE, verifies a shared secret (in Secrets Manager undertr/form/secret) on the POST body. Builds a ticket from the form fields in the same shape as the mail lane, writes totr-tickets, enqueues the id. Memory: 256 MB. Timeout: 15 s.intake-chat— Lambda Function URL, signature-verified against the chat tool’s secret. Folds a finished conversation into one ticket body, writes totr-tickets, enqueues the id. Memory: 256 MB. Timeout: 15 s.read— SQS event source ontr-intake(batch size 5, partial-batch responses enabled). For each ticket, calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0viaglobal.anthropic.claude-haiku-4-5-20251001-v1:0) with the ticket body, the topic list fromrules.csv, and the labelled examples fromexamples.jsonl; parses the returned JSON (topic,urgency,tone); writes the tags back totr-tickets. On a malformed model response, retries once with a stricter prompt, then tagstopic: unsureso the router holds it. Memory: 512 MB. Timeout: 30 s. This is the only Bedrock callsite.router— invoked byreadafter tagging (or as a second SQS stage). Readss3://tr-rules-source/rules.csv(topic-to-team map) andvoice.txt(urgency words, VIP list, priority rules). Applies the decision flow from Part 3, picks one ofroute,priority,escalate,hold, and writes a row totr-routes. No Bedrock calls. Memory: 256 MB. Timeout: 15 s.dispatch— triggered on newtr-routesrows (DynamoDB Streams). Resolves the team, sets the queue position, runs the duplicate check against recent open tickets, formats the hand-off card, and ships via Slackchat.postMessage(tr/slack/bot-tokenin Secrets Manager) or SESSendRawEmailto the team’s shared inbox. Writes the dispatch outcome back totr-routes. Memory: 256 MB. Timeout: 30 s.correct-handler— Lambda Function URL, public withAuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive button clicks (Reassign/Bump/Split) and by email-link clicks. Updatestr-ticketsandtr-routes; writes totr-correctionsandtr-audit; on reassign or bump, refreshesexamples.jsonlins3://tr-rules-source/with the corrected label (capped to the most recent N examples per topic). On split, creates two new tickets and re-enqueues both. Memory: 256 MB. Timeout: 15 s.drive-sync— EventBridge Scheduler target, fires every 15 minutes. Uses the Google Sheets API + Docs API (service-account credentials in Secrets Manager undertr/drive/sa) to export the rules sheet and the rules doc, writingrules.csvandvoice.txttos3://tr-rules-source/only if changed since the last sync. Memory: 256 MB. Timeout: 30 s.digest— EventBridge Scheduler target, weekly Sunday 6pm. Readstr-routesandtr-correctionsfor the past week; posts a summary to a configured Slack channel: volume by topic and team, the correction rate, and the slowest queues. No Bedrock; a plain summary table. Memory: 256 MB.summary— EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’str-routes,tr-corrections, andtr-audit; calls Bedrock Haiku 4.5 to write a one-paragraph narrative (busiest topics, slowest queues, what the corrections taught); emails it via SES to the configured stakeholder list. Memory: 512 MB.
Storage
- DynamoDB ·
tr-tickets— one row per ticket. PKticket_id; attributes:customer,source(mail/form/chat),subject,body,topic,urgency,tone,status,received_at. GSI on(customer, topic)for the duplicate check. On-demand. - DynamoDB ·
tr-routes— one row per routing decision. PKticket_id; attributes:topic,urgency,tone,team,move(route/priority/escalate/hold),queue_pos,dispatched_via,decided_at. DynamoDB Streams enabled to triggerdispatch. On-demand. - DynamoDB ·
tr-corrections— one row per human correction. PK(ticket_id, ts); attributes:action(reassign/bump/split),orig_topic,orig_team,new_topic,new_team,by_user. This table feeds the labelled-example refresh and the correction-rate metric. On-demand. - DynamoDB ·
tr-audit— one row per write action of any kind. PK(ticket_id, ts); attributes:action,by_user,before,after. On-demand. No TTL — this is the long-term audit trail. - S3 ·
tr-raw-mail— raw inbound MIME from the email lane. Lifecycle to Glacier at 30 days; expiry at 1 year. - S3 ·
tr-rules-source— mirroredrules.csv,voice.txt, andexamples.jsonl. Versioning enabled so a bad rules edit or example flood can be rolled back in one click.
SQS
tr-intake— standard queue between the three intake lanes and thereadLambda. Visibility timeout 60 s (6× the read function timeout). Absorbs bursts so a spike of tickets never overruns Bedrock’s rate limit.tr-intake-dlq— dead-letter queue,maxReceiveCount3. A ticket that fails the read three times (malformed body, model error) lands here and pages the on-call admin instead of silently disappearing. Redrive back totr-intakeonce the cause is fixed.
Bedrock
- Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites:readfor the per-ticket topic/urgency/tone classification, andsummaryfor the monthly narrative. Sonnet 4.6 is not used — classification is well within Haiku’s reach, and a heavier model on the hot path would multiply the dominant cost for no gain. - Embeddings. Not used. Routing is a fresh read plus a sheet lookup; deterministic mapping beats vector retrieval here. No Knowledge Base, no S3 Vectors.
- Quotas. Default account quotas cover SMB volume comfortably. SQS in front of
readsmooths bursts so the per-ticket calls stay under the on-demand throughput limit.
SES inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
support.your-company.com) toinbound-smtp.ap-southeast-1.amazonaws.com. - SES inbound rule set
tr-inbound-rules: one rule with recipientsupport@your-company.com→ spam scan → S3 PUT tos3://tr-raw-mail/<message-id>→ stop. The S3 PUT triggersintake-mail. - SES outbound for the email-fallback hand-offs and the monthly summary: verify a sender identity at
router@your-company.comwith DKIM and SPF on the parent domain. Out of sandbox by request.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- read role:
sqs:ReceiveMessage+DeleteMessageontr-intake;bedrock:InvokeModelon the Haiku ARN;s3:GetObjecton the rules and examples keys;dynamodb:UpdateItemontr-tickets. - router role:
s3:GetObjectonrules.csvandvoice.txt;dynamodb:GetItemontr-tickets;dynamodb:PutItemontr-routes. Nobedrock:*. - dispatch role:
dynamodb:GetRecordson thetr-routesstream;dynamodb:Queryon thetr-ticketsGSI for the duplicate check;secretsmanager:GetSecretValueon the Slack bot token;ses:SendRawEmailfrom the verified sender; outbound network toslack.com. - correct-handler role:
dynamodb:UpdateItemontr-ticketsandtr-routes;dynamodb:PutItemontr-correctionsandtr-audit;s3:GetObject+PutObjectonexamples.jsonl;sqs:SendMessageontr-intake(for split). Verifies the Slack signing secret on every request. - intake-* roles:
s3:GetObjectontr-raw-mail(mail only);dynamodb:PutItem+Queryontr-tickets;sqs:SendMessageontr-intake; the secret for the lane’s shared secret or signature. - drive-sync role:
secretsmanager:GetSecretValueon the Google service-account secret;s3:PutObjectontr-rules-source; outbound network towww.googleapis.com.
Slack interactive flow
Hand-off cards are posted via the chat.postMessage Web API with Block Kit blocks containing the action buttons (Reassign, Bump, Split). Button clicks are sent by Slack to the configured Interactivity request URL, which is the correct-handler Function URL. correct-handler verifies the Slack signing secret on the inbound request, parses the action_id (reassign, bump, split), opens a menu or modal where needed (Reassign opens a team menu; Split opens a two-field modal; Bump is one-tap), and processes the response on submit.
The Slack app needs chat:write and the Interactivity URL configured. The bot token lives in Secrets Manager under tr/slack/bot-token; the signing secret under tr/slack/signing-secret.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms:
tr-intake-dlqdepth > 0 (a ticket failed to read three times);readBedrock throttle rate > 1% in 24h; dispatch failure rate > 1% in 24h;correct-handlersignature-verification failures > 5/hour (might mean the Slack secret rotated). - Custom metric: correction rate —
tr-correctionswrites overtr-routeswrites — tracked weekly. A rising correction rate on a topic means the examples for it need a refresh or the rules sheet needs a new row. - X-Ray: off by default. Not worth the cost at SMB volume.
- AWS Budgets: $15/month threshold for a typical SMB, alarm at 80% and 100%, posts to SNS topic
tr-cost-alarmsubscribed to the on-call admin’s email and Slack. Raise the ceiling to match higher steady volume.
Config and secrets
Service-account credentials for the Sheets and Docs APIs live in Secrets Manager under tr/drive/sa. Slack bot token and signing secret under tr/slack/*. The web-form and chat lane secrets under tr/form/secret and tr/chat/secret. SES sender identity lives in IAM and the verified-domain config. The topic list, the VIP list reference, the priority rules, and the admin fallback team all live in Parameter Store under /tr/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
GitHub Actions with OIDC into a deploy role (no long-lived keys) running AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for tr-rules-source so a bad rules edit or example flood can be rolled back in one click, and keep the DLQ alarm wired before going live so a read failure pages someone instead of vanishing. Total deployable surface: around ten Lambdas, four DDB tables, two S3 buckets, two SQS queues (one DLQ), one SES rule set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.
All posts