Engineering reference: the churn predictor architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is one missed weekly list, not a regional outage. One AWS account dedicated to the predictor (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
drive-sync— EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager undercp/drive/sa) to export the customer sheet as CSV and write tos3://cp-list-source/customers.csvonly if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs tos3://cp-rules-source/. Memory: 256 MB. Timeout: 30 s.order-import— S3 PUT trigger ons3://cp-order-feed/(the store or billing tool drops a daily CSV; a small connector or a scheduled export populates it). Groups rows by customer, deriveslast_order_dateandorder_pace(median inter-order gap over a trailing window), and writes them back to the Drive sheet via the Sheets APIbatchUpdate. Idempotent on re-run of the same file. No model — these are facts. Memory: 256 MB. Timeout: 60 s.mood-reader— S3 PUT trigger ons3://cp-raw-mime/. Parses the MIME, extracts the ticket body and the customer’s email/identifier, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0viaglobal.anthropic.claude-haiku-4-5-20251001-v1:0) with a constrained prompt that returns one ofsour/flat/happy. Maps the label to a number, blends it with the customer’s recent mood (exponential moving average over the trailing few tickets so one bad day doesn’t dominate), and writessupport_moodback to the sheet. Strictly read-only with respect to the customer — it never drafts or sends a reply. Memory: 256 MB. Timeout: 30 s.scorer— EventBridge Scheduler target, weekly Monday at 8am local time (the schedule expression runs inTZ_NAMEset to the SMB’s timezone, e.g.Asia/Singapore). Readss3://cp-list-source/customers.csvand the rules and voice docs. For each row, turns each signal into points using the weights, sums to a total out of 100, reads prior state fromcp-state, and assigns a band. Emits onecp.weekly_listevent per owner carrying that owner’s at-risk and churning candidates with their scores and per-signal point breakdowns as the event payload. Steady and watch customers emit no list event. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.handoff— EventBridge rule on thecp.weekly_listevent. Resolves owner, applies the cap (rank by score, churning first, keep top N from the rules doc), drops candidates inside the contact pause window read fromcp-state, and for each surviving name calls Bedrock Haiku 4.5 to render the point breakdown into a one-line plain reason (grounded strictly in the supplied points). Ships via Slackchat.postMessagewith Block Kit buttons (cp/slack/bot-tokenin Secrets Manager) or SESSendRawEmailfor email fallback. Writes the surfaced names tocp-stateafter a successful send. Memory: 512 MB. Timeout: 60 s.outcome-handler— Lambda Function URL, public withAuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive button clicks (Reached-out/Won-back/Lost) and by email-link clicks. Writes tocp-stateandcp-audit; on won-back, resets the customer’s score and clears the surfaced/contact fields; on lost, records the reason and marks the customer so the scorer stops surfacing them. Memory: 256 MB. Timeout: 15 s.digest— optional EventBridge Scheduler target, weekly Friday 4pm. Readscp-statefor the watch band and the week’s outcomes; posts a short “watch list and outcomes so far” message to a configured Slack channel. No Bedrock; a plain summary table. Memory: 256 MB.summary— EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’scp-stateandcp-audit; calls Bedrock Haiku 4.5 to write a one-paragraph owner narrative (flagged, reached, won back with recovered value, lost with top reasons); emails it via SES to the configured stakeholder list. Memory: 512 MB.
Storage
- DynamoDB ·
cp-state— one row per customer. PKcustomer_id; attributes:score,band,reason,surfaced_date,last_contact,status(active/lost),owner. On-demand. No TTL — it’s the live state the scorer reads each week. - DynamoDB ·
cp-audit— one row per write action of any kind. PK(customer_id, ts); attributes:action(reached_out/won_back/lost/undo),by_user,before,after,notes(e.g. lost-reason, recovered value). On-demand. No TTL — this is the long-term outcome trail the summary counts from. - S3 ·
cp-list-source— mirrored CSV from the Drive customer list. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years. - S3 ·
cp-rules-source— mirrored rules and voice docs as plain text. Versioning enabled. - S3 ·
cp-order-feed— daily order exports from the store/billing tool. Lifecycle to Glacier at 30 days; expiry at 1 year (the derived columns live in the sheet, so the raw exports are short-lived). - S3 ·
cp-raw-mime— raw inbound MIME from forwarded support tickets. Lifecycle to Glacier at 30 days; expiry at 7 years.
Bedrock
- Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. Three callsites:mood-readerfor ticket sentiment,handofffor the per-name plain reason, andsummaryfor the monthly narrative. If a heavier monthly analysis is ever wanted (cohort patterns across reasons),summarycan be promoted toanthropic.claude-sonnet-4-6-20250930-v1:0via its Global profile — but Haiku is enough for the current paragraph. - Embeddings. Not used. The list is structured rows and the score is plain arithmetic; deterministic math beats vector retrieval here. No Knowledge Base, no S3 Vectors.
- Quotas. Default account quotas are more than enough at SMB volume. The scorer itself doesn’t call Bedrock; the mood and reason calls are small and bursty around ticket arrival and the Monday run.
EventBridge Scheduler config
cp-weekly-run—cron(0 8 ? * 2 *)(Mondays at 8am) in the SMB’s timezone. Target:scorerLambda.cp-drive-sync—rate(15 minutes). Target:drive-syncLambda.cp-weekly-digest—cron(0 16 ? * 6 *)(Fridays 4pm) in TZ. Target:digestLambda.cp-monthly-summary—cron(0 9 ? * 2#1 *)(first Monday at 9am) in TZ. Target:summaryLambda.- Order import — the
order-importLambda is S3-PUT-driven oncp-order-feed, not Scheduler-driven, so it runs whenever the export lands. If the store can’t push on a schedule, arate(1 day)Scheduler rule can pull instead.
SES inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
support-signals.your-company.com) toinbound-smtp.ap-southeast-1.amazonaws.com. - SES inbound rule set
cp-inbound-rules: one rule with recipientsupport-signals@your-company.com→ spam scan → S3 PUT tos3://cp-raw-mime/<message-id>→ stop. The S3 PUT triggersmood-reader. - SES outbound for the email-fallback lists and the monthly summary: verify a sender identity at
churn@your-company.comwith DKIM and SPF on the parent domain. Out of sandbox by request.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- scorer role:
s3:GetObjecton the list, rules, and voice keys;dynamodb:Query+GetItemoncp-state;events:PutEventson the default bus. Nobedrock:*. - handoff role:
s3:GetObjecton the voice doc;bedrock:InvokeModelon the Haiku ARN;secretsmanager:GetSecretValueon the Slack bot token;ses:SendRawEmailfrom the verified sender;dynamodb:PutItem+Queryoncp-state; outbound network access toslack.com. - outcome-handler role:
dynamodb:PutItem+UpdateItemoncp-stateandcp-audit;secretsmanager:GetSecretValueon the Slack signing secret;dynamodb:Queryfor snapshot reads on undo. - mood-reader role:
s3:GetObjectoncp-raw-mime;bedrock:InvokeModelon the Haiku ARN;secretsmanager:GetSecretValueon the Sheets-API service-account secret; outbound network tosheets.googleapis.com. - order-import and drive-sync roles:
secretsmanager:GetSecretValueon the Google service-account secret;s3:GetObject/PutObjecton the relevant buckets; outbound network towww.googleapis.com. Nobedrock:*.
Slack interactive flow
The weekly list is posted via the Slack chat.postMessage Web API with Block Kit blocks containing one row per customer and three action buttons each. Button clicks are sent by Slack to the configured Interactivity request URL, which is the outcome-handler Function URL. outcome-handler verifies the Slack signing secret on the inbound request, parses the action_id (reached_out, won_back, lost), opens a modal if needed (Lost opens a small reason picker; Reached-out and Won-back are one-tap, Won-back optionally confirming the recovered value), and processes the response when the modal is submitted.
The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under cp/slack/bot-token. The signing secret is cp/slack/signing-secret.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms: scorer Lambda failures > 0 in a week (the weekly run is the one piece that has to fire); handoff failure rate > 1% in 24h; outcome-handler signature-verification failures > 5/hour (might mean the Slack secret rotated).
- X-Ray: off by default. Not worth the cost at SMB volume.
- AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic
cp-cost-alarmsubscribed to the on-call admin’s email and Slack.
Config and secrets
Service-account credentials for Drive and Sheets APIs live in Secrets Manager under cp/drive/sa (one service account with scopes for both APIs). Slack bot token and signing secret under cp/slack/*. SES sender identity lives in IAM and the verified-domain config. The configured timezone, the signal weights and band cut-offs (mirrored from the rules doc for fast reads), the weekly cap, the contact pause window, and the admin fallback owner all live in Parameter Store under /cp/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
Whichever IaC you prefer. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both cp-list-source and cp-rules-source so a bad Drive edit can be rolled back in one click, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the weekly run in UTC after a CI rotation. Deploy with GitHub Actions using OIDC (no long-lived keys) and AWS SAM; a CDK Python stack also fits. Total deployable surface: around eight Lambdas, two DDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.
All posts