Engineering reference: the refund handler architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, the S3 Vectors policy index, Lambda inventory, IAM scopes, the SES inbound rule set, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, S3 Vectors, and Lambda Function URLs are all available there. A second region for resilience isn’t worth the extra work at SMB volume — the failure mode for an SMB is a refund email sitting unanswered, which the SQS queue and DLQ already protect against, not a regional outage. One AWS account dedicated to the handler (separate from other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. Because the handler can send replies and record decisions but never moves money, the account never needs payment credentials at all.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
intake-reader— S3 PUT trigger ons3://rf-raw-mime/for the email lane, and the target of the intake Function URL for the form and paste lanes. Parses the MIME (or the JSON body), and for free-text email calls Bedrock Haiku 4.5 to extract{customer, email, item, order_ref, amount, asked_for}as strict JSON with blanks for anything not present. Writes a clean record to therf-intakeSQS queue. Memory: 256 MB. Timeout: 30 s.intake-url— Lambda Function URL,AuthType: NONE, verifies a shared secret header and applies a small token-bucket rate limit before handing the body to the same reader path. Backs both the contact form (Lane 2) and the manual paste form (Lane 3, with asource=pasteflag). Memory: 256 MB. Timeout: 15 s.checker— SQS event source onrf-intake(batch size 1 for clean per-request retry semantics). Embeds the request via Titan Text Embeddings V2 (amazon.titan-embed-text-v2:0, 1024-dim), queries therf-policyS3 Vectors index for the top-k passages, and calls Bedrock Haiku 4.5 (global.anthropic.claude-haiku-4-5-20251001-v1:0) with a decide-only-from-these-passages prompt. If the model returns low confidence or conflicting passages, re-runs the single decision on Claude Sonnet 4.6 (global.anthropic.claude-sonnet-4-6-20250930-v1:0). Writes state torf-requestsand emits one ofrf.in_policy,rf.out_of_policy,rf.high_value,rf.not_coveredwith the cited passage id. Memory: 512 MB. Timeout: 60 s.drafter— EventBridge rule on the four outcome events. Fetches the voice template for the outcome froms3://rf-policy-source/voice.txt, calls Haiku 4.5 to draft the reply, then runs a deterministicsafety_check()(amount ≤ requested, outcome/answer agreement, no invented order/date/amount). Posts an approval card to Slack viachat.postMessage(Block Kit, with Approve/Edit/Decline) for the right channel/DM, or sends an email card via SES outbound.not_coveredposts a card with no draft. Memory: 512 MB. Timeout: 30 s. No money movement.approve-handler— Lambda Function URL,AuthType: NONE; verifies the Slack signing secret. Handles Approve (send reply via SESSendRawEmail, markrf-requestsresolved, writerf-audit, post the decision row to the configured finance sink), Edit (open a modal pre-filled with the draft; on submit, send the edited reply and log the diff), and Decline (require a reason, send the decline reply, close as declined). Never calls any payment API — the actual payout stays a human/finance step. Memory: 256 MB. Timeout: 15 s.policy-sync— EventBridge Scheduler target, fires every 15 minutes. Uses the Google Docs/Drive API (service-account credentials in Secrets Manager underrf/drive/sa) to export the policy and voice docs tos3://rf-policy-source/only if changed, splits the policy into passages, embeds each with Titan V2, and upserts them into therf-policyS3 Vectors index (deleting passage ids that no longer exist). Memory: 512 MB. Timeout: 60 s.digest— EventBridge Scheduler target, weekly Monday 9am. Readsrf-requestsandrf-auditfor the past week; posts a Slack summary: requests in, approved, edited, declined, and any in-policy drafts a human overrode. No Bedrock; a plain summary table. Memory: 256 MB.
Storage
- DynamoDB ·
rf-requests— one row per request, its live state. PKrequest_id; attributes:source(email/form/paste),customer,item,order_ref,amount,outcome,cited_passage_id,status(queued/awaiting-approval/resolved/declined). On-demand. - DynamoDB ·
rf-audit— one row per write action of any kind. PK(request_id, ts); attributes:action(approved/edited/declined),by_user,cited_passage_id,amount,before,after. On-demand. No TTL — this is the long-term audit trail. - S3 Vectors ·
rf-policy— the embedded policy passages. One vector per passage with metadata{passage_id, heading, text}. Re-indexed bypolicy-syncon every change. - S3 ·
rf-policy-source— mirrored policy and voice docs as plain text. Versioning enabled, so a bad policy edit can be rolled back in one click. - S3 ·
rf-raw-mime— raw inbound MIME from the help inbox. Lifecycle to Glacier at 30 days; expiry at 7 years. - SQS ·
rf-intake— the single intake queue.rf-intake-dlqbehind it withmaxReceiveCount: 5so a poison record lands in the DLQ for inspection instead of looping.
Bedrock
- Foundation models.
global.anthropic.claude-haiku-4-5-20251001-v1:0for the request read, the policy decision on the common path, and the reply draft;global.anthropic.claude-sonnet-4-6-20250930-v1:0for the hard-case decision only. Both via the Global cross-Region inference profile. - Embeddings.
amazon.titan-embed-text-v2:0at 1024 dimensions, for both the policy passages (at index time) and each incoming request (at query time). Used into Amazon S3 Vectors. - Quotas. Default account quotas are more than enough at SMB volume. The Sonnet path fires on a small fraction of requests, so its share of the token spend stays low.
EventBridge config
- Outcome rule — one EventBridge rule on the default bus matching
detail-typein (rf.in_policy,rf.out_of_policy,rf.high_value,rf.not_covered). Target:drafterLambda. rf-policy-sync— EventBridge Scheduler,rate(15 minutes). Target:policy-syncLambda.rf-weekly-digest— EventBridge Scheduler,cron(0 9 ? * MON *)in TZ. Target:digestLambda.
SES inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
refunds.your-company.com) toinbound-smtp.ap-southeast-1.amazonaws.com. - SES inbound rule set
rf-inbound-rules: one rule with recipientrefunds@your-company.com→ spam scan → S3 PUT tos3://rf-raw-mime/<message-id>→ stop. The S3 PUT triggersintake-reader. - SES outbound for the approved replies and email-card fallback: verify a sender identity at
support@your-company.comwith DKIM and SPF on the parent domain. Out of sandbox by request.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- checker role:
sqs:ReceiveMessage+DeleteMessageonrf-intake;bedrock:InvokeModelon the Titan, Haiku, and Sonnet ARNs;s3vectors:QueryVectorson therf-policyindex;dynamodb:PutItem+GetItemonrf-requests;events:PutEventson the default bus. - drafter role:
s3:GetObjecton the voice key;bedrock:InvokeModelon the Haiku ARN;secretsmanager:GetSecretValueon the Slack bot token;ses:SendRawEmailfor the email-card fallback; outbound network access toslack.com. Nodynamodb:*writes beyond status. - approve-handler role:
ses:SendRawEmailfrom the verified sender;dynamodb:PutItemonrf-requestsandrf-audit;secretsmanager:GetSecretValueon the Slack signing secret. No payment-service permissions of any kind — the role literally cannot move money. - intake-reader / intake-url roles:
s3:GetObjectonrf-raw-mime;bedrock:InvokeModelon the Haiku ARN;sqs:SendMessageonrf-intake;secretsmanager:GetSecretValueon the form shared-secret. - policy-sync role:
secretsmanager:GetSecretValueon the Google service-account secret;s3:PutObjectonrf-policy-source;bedrock:InvokeModelon the Titan ARN;s3vectors:PutVectors+DeleteVectorson therf-policyindex; outbound network towww.googleapis.com.
Slack interactive flow
Approval cards are posted via the chat.postMessage Web API with Block Kit blocks containing the Approve, Edit, and Decline buttons. Button clicks are sent by Slack to the configured Interactivity request URL, which is the approve-handler Function URL. approve-handler verifies the Slack signing secret on the inbound request, parses the action_id (approve, edit, decline), opens a modal where needed (Edit and Decline open modals; Approve is one-tap), and processes the response on submit. High-value and out-of-policy cards are routed to the named approver’s DM rather than the shared channel.
The Slack app needs chat:write and im:write, plus the Interactivity URL configured. The bot token lives in Secrets Manager under rf/slack/bot-token; the signing secret is rf/slack/signing-secret.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms:
rf-intake-dlqdepth > 0 (a request failed to process); checker failure rate > 1% in 24h; approve-handler signature-verification failures > 5/hour (might mean the Slack secret rotated); Bedrock token spend anomaly via a daily Cost Anomaly Detection monitor. - X-Ray: off by default. Not worth the cost at SMB volume.
- AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic
rf-cost-alarmsubscribed to the on-call admin’s email and Slack.
Config and secrets
The Google service-account credentials for the Docs/Drive API live in Secrets Manager under rf/drive/sa. The Slack bot token and signing secret are under rf/slack/*. The contact-form shared secret is under rf/form/secret. SES sender identity lives in IAM and the verified-domain config. The dollar cap, the named approver per category, the top-k for policy retrieval, and the finance-sink target all live in Parameter Store under /rf/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for rf-policy-source so a bad policy edit rolls back in one click, and keep the approve-handler role free of any payment permissions so the “never moves money” guarantee is enforced by IAM, not just by code. Total deployable surface: around seven Lambdas, two DDB tables, one S3 Vectors index, three S3 buckets, one SQS queue plus its DLQ, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.
All posts