Engineering reference: the churn predictor architecture

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is one missed weekly list, not a regional outage. One AWS account dedicated to the predictor (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.

Topology

Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the list), scheduled processing (the weekly scorer emitting a per-owner list event), hand-off and outcome (the list ships and the owner’s outcome is recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

drive-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under cp/drive/sa) to export the customer sheet as CSV and write to s3://cp-list-source/customers.csv only if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs to s3://cp-rules-source/. Memory: 256 MB. Timeout: 30 s.
order-import — S3 PUT trigger on s3://cp-order-feed/ (the store or billing tool drops a daily CSV; a small connector or a scheduled export populates it). Groups rows by customer, derives last_order_date and order_pace (median inter-order gap over a trailing window), and writes them back to the Drive sheet via the Sheets API batchUpdate. Idempotent on re-run of the same file. No model — these are facts. Memory: 256 MB. Timeout: 60 s.
mood-reader — S3 PUT trigger on s3://cp-raw-mime/. Parses the MIME, extracts the ticket body and the customer’s email/identifier, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) with a constrained prompt that returns one of sour/flat/happy. Maps the label to a number, blends it with the customer’s recent mood (exponential moving average over the trailing few tickets so one bad day doesn’t dominate), and writes support_mood back to the sheet. Strictly read-only with respect to the customer — it never drafts or sends a reply. Memory: 256 MB. Timeout: 30 s.
scorer — EventBridge Scheduler target, weekly Monday at 8am local time (the schedule expression runs in TZ_NAME set to the SMB’s timezone, e.g. Asia/Singapore). Reads s3://cp-list-source/customers.csv and the rules and voice docs. For each row, turns each signal into points using the weights, sums to a total out of 100, reads prior state from cp-state, and assigns a band. Emits one cp.weekly_list event per owner carrying that owner’s at-risk and churning candidates with their scores and per-signal point breakdowns as the event payload. Steady and watch customers emit no list event. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.
handoff — EventBridge rule on the cp.weekly_list event. Resolves owner, applies the cap (rank by score, churning first, keep top N from the rules doc), drops candidates inside the contact pause window read from cp-state, and for each surviving name calls Bedrock Haiku 4.5 to render the point breakdown into a one-line plain reason (grounded strictly in the supplied points). Ships via Slack chat.postMessage with Block Kit buttons (cp/slack/bot-token in Secrets Manager) or SES SendRawEmail for email fallback. Writes the surfaced names to cp-state after a successful send. Memory: 512 MB. Timeout: 60 s.
outcome-handler — Lambda Function URL, public with AuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive button clicks (Reached-out/Won-back/Lost) and by email-link clicks. Writes to cp-state and cp-audit; on won-back, resets the customer’s score and clears the surfaced/contact fields; on lost, records the reason and marks the customer so the scorer stops surfacing them. Memory: 256 MB. Timeout: 15 s.
digest — optional EventBridge Scheduler target, weekly Friday 4pm. Reads cp-state for the watch band and the week’s outcomes; posts a short “watch list and outcomes so far” message to a configured Slack channel. No Bedrock; a plain summary table. Memory: 256 MB.
summary — EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’s cp-state and cp-audit; calls Bedrock Haiku 4.5 to write a one-paragraph owner narrative (flagged, reached, won back with recovered value, lost with top reasons); emails it via SES to the configured stakeholder list. Memory: 512 MB.

Storage

DynamoDB · cp-state — one row per customer. PK customer_id; attributes: score, band, reason, surfaced_date, last_contact, status (active/lost), owner. On-demand. No TTL — it’s the live state the scorer reads each week.
DynamoDB · cp-audit — one row per write action of any kind. PK (customer_id, ts); attributes: action (reached_out/won_back/lost/undo), by_user, before, after, notes (e.g. lost-reason, recovered value). On-demand. No TTL — this is the long-term outcome trail the summary counts from.
S3 · cp-list-source — mirrored CSV from the Drive customer list. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years.
S3 · cp-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
S3 · cp-order-feed — daily order exports from the store/billing tool. Lifecycle to Glacier at 30 days; expiry at 1 year (the derived columns live in the sheet, so the raw exports are short-lived).
S3 · cp-raw-mime — raw inbound MIME from forwarded support tickets. Lifecycle to Glacier at 30 days; expiry at 7 years.

Bedrock

Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Three callsites: mood-reader for ticket sentiment, handoff for the per-name plain reason, and summary for the monthly narrative. If a heavier monthly analysis is ever wanted (cohort patterns across reasons), summary can be promoted to anthropic.claude-sonnet-4-6-20250930-v1:0 via its Global profile — but Haiku is enough for the current paragraph.
Embeddings. Not used. The list is structured rows and the score is plain arithmetic; deterministic math beats vector retrieval here. No Knowledge Base, no S3 Vectors.
Quotas. Default account quotas are more than enough at SMB volume. The scorer itself doesn’t call Bedrock; the mood and reason calls are small and bursty around ticket arrival and the Monday run.

EventBridge Scheduler config

cp-weekly-run — cron(0 8 ? * 2 *) (Mondays at 8am) in the SMB’s timezone. Target: scorer Lambda.
cp-drive-sync — rate(15 minutes). Target: drive-sync Lambda.
cp-weekly-digest — cron(0 16 ? * 6 *) (Fridays 4pm) in TZ. Target: digest Lambda.
cp-monthly-summary — cron(0 9 ? * 2#1 *) (first Monday at 9am) in TZ. Target: summary Lambda.
Order import — the order-import Lambda is S3-PUT-driven on cp-order-feed, not Scheduler-driven, so it runs whenever the export lands. If the store can’t push on a schedule, a rate(1 day) Scheduler rule can pull instead.

SES inbound and outbound

Set the MX record on a dedicated subdomain (e.g. support-signals.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
SES inbound rule set cp-inbound-rules: one rule with recipient support-signals@your-company.com → spam scan → S3 PUT to s3://cp-raw-mime/<message-id> → stop. The S3 PUT triggers mood-reader.
SES outbound for the email-fallback lists and the monthly summary: verify a sender identity at churn@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

scorer role: s3:GetObject on the list, rules, and voice keys; dynamodb:Query + GetItem on cp-state; events:PutEvents on the default bus. No bedrock:*.
handoff role: s3:GetObject on the voice doc; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Slack bot token; ses:SendRawEmail from the verified sender; dynamodb:PutItem + Query on cp-state; outbound network access to slack.com.
outcome-handler role: dynamodb:PutItem + UpdateItem on cp-state and cp-audit; secretsmanager:GetSecretValue on the Slack signing secret; dynamodb:Query for snapshot reads on undo.
mood-reader role: s3:GetObject on cp-raw-mime; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Sheets-API service-account secret; outbound network to sheets.googleapis.com.
order-import and drive-sync roles: secretsmanager:GetSecretValue on the Google service-account secret; s3:GetObject/PutObject on the relevant buckets; outbound network to www.googleapis.com. No bedrock:*.

Slack interactive flow

The weekly list is posted via the Slack chat.postMessage Web API with Block Kit blocks containing one row per customer and three action buttons each. Button clicks are sent by Slack to the configured Interactivity request URL, which is the outcome-handler Function URL. outcome-handler verifies the Slack signing secret on the inbound request, parses the action_id (reached_out, won_back, lost), opens a modal if needed (Lost opens a small reason picker; Reached-out and Won-back are one-tap, Won-back optionally confirming the recovered value), and processes the response when the modal is submitted.

The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under cp/slack/bot-token. The signing secret is cp/slack/signing-secret.

Observability and cost gates

CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
Alarms: scorer Lambda failures > 0 in a week (the weekly run is the one piece that has to fire); handoff failure rate > 1% in 24h; outcome-handler signature-verification failures > 5/hour (might mean the Slack secret rotated).
X-Ray: off by default. Not worth the cost at SMB volume.
AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic cp-cost-alarm subscribed to the on-call admin’s email and Slack.

Config and secrets

Service-account credentials for Drive and Sheets APIs live in Secrets Manager under cp/drive/sa (one service account with scopes for both APIs). Slack bot token and signing secret under cp/slack/*. SES sender identity lives in IAM and the verified-domain config. The configured timezone, the signal weights and band cut-offs (mirrored from the rules doc for fast reads), the weekly cap, the contact pause window, and the admin fallback owner all live in Parameter Store under /cp/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

Whichever IaC you prefer. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both cp-list-source and cp-rules-source so a bad Drive edit can be rolled back in one click, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the weekly run in UTC after a CI rotation. Deploy with GitHub Actions using OIDC (no long-lived keys) and AWS SAM; a CDK Python stack also fits. Total deployable surface: around eight Lambdas, two DDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts