Engineering reference: the refund handler architecture

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, S3 Vectors, and Lambda Function URLs are all available there. A second region for resilience isn’t worth the extra work at SMB volume — the failure mode for an SMB is a refund email sitting unanswered, which the SQS queue and DLQ already protect against, not a regional outage. One AWS account dedicated to the handler (separate from other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. Because the handler can send replies and record decisions but never moves money, the account never needs payment credentials at all.

Topology

Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into one queue), per-request processing (the checker grounds against the S3 Vectors policy index and emits an outcome event), reply and approval (the drafter writes and the approver’s decision is recorded). Every Lambda is event- or queue-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

intake-reader — S3 PUT trigger on s3://rf-raw-mime/ for the email lane, and the target of the intake Function URL for the form and paste lanes. Parses the MIME (or the JSON body), and for free-text email calls Bedrock Haiku 4.5 to extract {customer, email, item, order_ref, amount, asked_for} as strict JSON with blanks for anything not present. Writes a clean record to the rf-intake SQS queue. Memory: 256 MB. Timeout: 30 s.
intake-url — Lambda Function URL, AuthType: NONE, verifies a shared secret header and applies a small token-bucket rate limit before handing the body to the same reader path. Backs both the contact form (Lane 2) and the manual paste form (Lane 3, with a source=paste flag). Memory: 256 MB. Timeout: 15 s.
checker — SQS event source on rf-intake (batch size 1 for clean per-request retry semantics). Embeds the request via Titan Text Embeddings V2 (amazon.titan-embed-text-v2:0, 1024-dim), queries the rf-policy S3 Vectors index for the top-k passages, and calls Bedrock Haiku 4.5 (global.anthropic.claude-haiku-4-5-20251001-v1:0) with a decide-only-from-these-passages prompt. If the model returns low confidence or conflicting passages, re-runs the single decision on Claude Sonnet 4.6 (global.anthropic.claude-sonnet-4-6-20250930-v1:0). Writes state to rf-requests and emits one of rf.in_policy, rf.out_of_policy, rf.high_value, rf.not_covered with the cited passage id. Memory: 512 MB. Timeout: 60 s.
drafter — EventBridge rule on the four outcome events. Fetches the voice template for the outcome from s3://rf-policy-source/voice.txt, calls Haiku 4.5 to draft the reply, then runs a deterministic safety_check() (amount ≤ requested, outcome/answer agreement, no invented order/date/amount). Posts an approval card to Slack via chat.postMessage (Block Kit, with Approve/Edit/Decline) for the right channel/DM, or sends an email card via SES outbound. not_covered posts a card with no draft. Memory: 512 MB. Timeout: 30 s. No money movement.
approve-handler — Lambda Function URL, AuthType: NONE; verifies the Slack signing secret. Handles Approve (send reply via SES SendRawEmail, mark rf-requests resolved, write rf-audit, post the decision row to the configured finance sink), Edit (open a modal pre-filled with the draft; on submit, send the edited reply and log the diff), and Decline (require a reason, send the decline reply, close as declined). Never calls any payment API — the actual payout stays a human/finance step. Memory: 256 MB. Timeout: 15 s.
policy-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Docs/Drive API (service-account credentials in Secrets Manager under rf/drive/sa) to export the policy and voice docs to s3://rf-policy-source/ only if changed, splits the policy into passages, embeds each with Titan V2, and upserts them into the rf-policy S3 Vectors index (deleting passage ids that no longer exist). Memory: 512 MB. Timeout: 60 s.
digest — EventBridge Scheduler target, weekly Monday 9am. Reads rf-requests and rf-audit for the past week; posts a Slack summary: requests in, approved, edited, declined, and any in-policy drafts a human overrode. No Bedrock; a plain summary table. Memory: 256 MB.

Storage

DynamoDB · rf-requests — one row per request, its live state. PK request_id; attributes: source (email/form/paste), customer, item, order_ref, amount, outcome, cited_passage_id, status (queued/awaiting-approval/resolved/declined). On-demand.
DynamoDB · rf-audit — one row per write action of any kind. PK (request_id, ts); attributes: action (approved/edited/declined), by_user, cited_passage_id, amount, before, after. On-demand. No TTL — this is the long-term audit trail.
S3 Vectors · rf-policy — the embedded policy passages. One vector per passage with metadata {passage_id, heading, text}. Re-indexed by policy-sync on every change.
S3 · rf-policy-source — mirrored policy and voice docs as plain text. Versioning enabled, so a bad policy edit can be rolled back in one click.
S3 · rf-raw-mime — raw inbound MIME from the help inbox. Lifecycle to Glacier at 30 days; expiry at 7 years.
SQS · rf-intake — the single intake queue. rf-intake-dlq behind it with maxReceiveCount: 5 so a poison record lands in the DLQ for inspection instead of looping.

Bedrock

Foundation models. global.anthropic.claude-haiku-4-5-20251001-v1:0 for the request read, the policy decision on the common path, and the reply draft; global.anthropic.claude-sonnet-4-6-20250930-v1:0 for the hard-case decision only. Both via the Global cross-Region inference profile.
Embeddings. amazon.titan-embed-text-v2:0 at 1024 dimensions, for both the policy passages (at index time) and each incoming request (at query time). Used into Amazon S3 Vectors.
Quotas. Default account quotas are more than enough at SMB volume. The Sonnet path fires on a small fraction of requests, so its share of the token spend stays low.

EventBridge config

Outcome rule — one EventBridge rule on the default bus matching detail-type in (rf.in_policy, rf.out_of_policy, rf.high_value, rf.not_covered). Target: drafter Lambda.
rf-policy-sync — EventBridge Scheduler, rate(15 minutes). Target: policy-sync Lambda.
rf-weekly-digest — EventBridge Scheduler, cron(0 9 ? * MON *) in TZ. Target: digest Lambda.

SES inbound and outbound

Set the MX record on a dedicated subdomain (e.g. refunds.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
SES inbound rule set rf-inbound-rules: one rule with recipient refunds@your-company.com → spam scan → S3 PUT to s3://rf-raw-mime/<message-id> → stop. The S3 PUT triggers intake-reader.
SES outbound for the approved replies and email-card fallback: verify a sender identity at support@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

checker role: sqs:ReceiveMessage + DeleteMessage on rf-intake; bedrock:InvokeModel on the Titan, Haiku, and Sonnet ARNs; s3vectors:QueryVectors on the rf-policy index; dynamodb:PutItem + GetItem on rf-requests; events:PutEvents on the default bus.
drafter role: s3:GetObject on the voice key; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Slack bot token; ses:SendRawEmail for the email-card fallback; outbound network access to slack.com. No dynamodb:* writes beyond status.
approve-handler role: ses:SendRawEmail from the verified sender; dynamodb:PutItem on rf-requests and rf-audit; secretsmanager:GetSecretValue on the Slack signing secret. No payment-service permissions of any kind — the role literally cannot move money.
intake-reader / intake-url roles: s3:GetObject on rf-raw-mime; bedrock:InvokeModel on the Haiku ARN; sqs:SendMessage on rf-intake; secretsmanager:GetSecretValue on the form shared-secret.
policy-sync role: secretsmanager:GetSecretValue on the Google service-account secret; s3:PutObject on rf-policy-source; bedrock:InvokeModel on the Titan ARN; s3vectors:PutVectors + DeleteVectors on the rf-policy index; outbound network to www.googleapis.com.

Slack interactive flow

Approval cards are posted via the chat.postMessage Web API with Block Kit blocks containing the Approve, Edit, and Decline buttons. Button clicks are sent by Slack to the configured Interactivity request URL, which is the approve-handler Function URL. approve-handler verifies the Slack signing secret on the inbound request, parses the action_id (approve, edit, decline), opens a modal where needed (Edit and Decline open modals; Approve is one-tap), and processes the response on submit. High-value and out-of-policy cards are routed to the named approver’s DM rather than the shared channel.

The Slack app needs chat:write and im:write, plus the Interactivity URL configured. The bot token lives in Secrets Manager under rf/slack/bot-token; the signing secret is rf/slack/signing-secret.

Observability and cost gates

CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
Alarms: rf-intake-dlq depth > 0 (a request failed to process); checker failure rate > 1% in 24h; approve-handler signature-verification failures > 5/hour (might mean the Slack secret rotated); Bedrock token spend anomaly via a daily Cost Anomaly Detection monitor.
X-Ray: off by default. Not worth the cost at SMB volume.
AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic rf-cost-alarm subscribed to the on-call admin’s email and Slack.

Config and secrets

The Google service-account credentials for the Docs/Drive API live in Secrets Manager under rf/drive/sa. The Slack bot token and signing secret are under rf/slack/*. The contact-form shared secret is under rf/form/secret. SES sender identity lives in IAM and the verified-domain config. The dollar cap, the named approver per category, the top-k for policy retrieval, and the finance-sink target all live in Parameter Store under /rf/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for rf-policy-source so a bad policy edit rolls back in one click, and keep the approve-handler role free of any payment permissions so the “never moves money” guarantee is enforced by IAM, not just by code. Total deployable surface: around seven Lambdas, two DDB tables, one S3 Vectors index, three S3 buckets, one SQS queue plus its DLQ, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts