Part 7 of 7 · Churn predictor series ~8 min read

Engineering reference: the churn predictor architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is one missed weekly list, not a regional outage. One AWS account dedicated to the predictor (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.

Topology

AWS topology of the churn predictor A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three intake lanes — a Drive sheet sync via the drive-sync Lambda triggered every 15 minutes by EventBridge Scheduler that mirrors the customer CSV to s3://cp-list-source/, an order-feed import where a daily export lands in s3://cp-order-feed/ and the order-import Lambda updates last-order and pace columns via the Sheets API, and a support-inbox lane where an SES inbound rule set writes raw MIME to s3://cp-raw-mime/ and the mood-reader Lambda calls Bedrock Haiku 4.5 to rate each ticket's mood and write a number back. Middle region: scheduled processing. The scorer Lambda is triggered weekly Monday at 8am local by EventBridge Scheduler; it reads s3://cp-list-source/customers.csv, iterates rows, sums signal points per customer using the weights in s3://cp-rules-source/rules.txt, reads prior state from DynamoDB cp-state, and emits one event to the EventBridge default bus per owner with that owner's at-risk and churning candidates: cp.weekly_list. Bottom region: hand-off and outcome. The handoff Lambda is triggered by an EventBridge rule on that event; it resolves the owner, applies the cap, skips recent contacts, calls Bedrock Haiku 4.5 to write a plain reason per name from the voice template in s3://cp-rules-source/voice.txt, posts the list to Slack via chat.postMessage with Reached out, Won back, and Lost buttons or sends an email via SES outbound, and writes the surfaced names to DynamoDB cp-state. Slack interactive button clicks land on a Function URL Lambda outcome-handler that updates cp-state with the action and writes to cp-audit. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic cp-cost-alarm. A note at the bottom: the system only flags and explains — every interaction is logged to cp-audit, and nothing is sent to a customer. Ingress Lambda · drive-sync every 15 min Sheets API → s3://cp-list-source/ customers.csv Lambda · order-import daily export s3://cp-order-feed/ updates last order and pace columns SES + Lambda · mood ticket → s3://cp-raw-mime/ Haiku 4.5 rates mood sour / flat / happy → number to sheet Drive customer list canonical store · mirrored to S3 Scheduled processing EventBridge Scheduler cron(0 8 ? * 2 *) in TZ_NAME target: scorer Lambda + daily import trigger Lambda · scorer reads CSV from S3 + rules.txt + voice.txt sums points, picks one of four bands EventBridge default bus cp.weekly_list (per owner, at-risk + churning) (steady → no event) Hand-off & outcome Lambda · handoff resolves owner, cap, skip recent; Slack post or SES outbound Slack interactive DM with [Reached out] [Won back] [Lost] button clicks → Function URL Lambda · outcome-handler writes cp-state, cp-audit; on won back resets the score to steady The system only flags and explains — every interaction is logged to cp-audit, never sent to a customer.
Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the list), scheduled processing (the weekly scorer emitting a per-owner list event), hand-off and outcome (the list ships and the owner’s outcome is recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • drive-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under cp/drive/sa) to export the customer sheet as CSV and write to s3://cp-list-source/customers.csv only if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs to s3://cp-rules-source/. Memory: 256 MB. Timeout: 30 s.
  • order-import — S3 PUT trigger on s3://cp-order-feed/ (the store or billing tool drops a daily CSV; a small connector or a scheduled export populates it). Groups rows by customer, derives last_order_date and order_pace (median inter-order gap over a trailing window), and writes them back to the Drive sheet via the Sheets API batchUpdate. Idempotent on re-run of the same file. No model — these are facts. Memory: 256 MB. Timeout: 60 s.
  • mood-reader — S3 PUT trigger on s3://cp-raw-mime/. Parses the MIME, extracts the ticket body and the customer’s email/identifier, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) with a constrained prompt that returns one of sour/flat/happy. Maps the label to a number, blends it with the customer’s recent mood (exponential moving average over the trailing few tickets so one bad day doesn’t dominate), and writes support_mood back to the sheet. Strictly read-only with respect to the customer — it never drafts or sends a reply. Memory: 256 MB. Timeout: 30 s.
  • scorer — EventBridge Scheduler target, weekly Monday at 8am local time (the schedule expression runs in TZ_NAME set to the SMB’s timezone, e.g. Asia/Singapore). Reads s3://cp-list-source/customers.csv and the rules and voice docs. For each row, turns each signal into points using the weights, sums to a total out of 100, reads prior state from cp-state, and assigns a band. Emits one cp.weekly_list event per owner carrying that owner’s at-risk and churning candidates with their scores and per-signal point breakdowns as the event payload. Steady and watch customers emit no list event. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.
  • handoff — EventBridge rule on the cp.weekly_list event. Resolves owner, applies the cap (rank by score, churning first, keep top N from the rules doc), drops candidates inside the contact pause window read from cp-state, and for each surviving name calls Bedrock Haiku 4.5 to render the point breakdown into a one-line plain reason (grounded strictly in the supplied points). Ships via Slack chat.postMessage with Block Kit buttons (cp/slack/bot-token in Secrets Manager) or SES SendRawEmail for email fallback. Writes the surfaced names to cp-state after a successful send. Memory: 512 MB. Timeout: 60 s.
  • outcome-handler — Lambda Function URL, public with AuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive button clicks (Reached-out/Won-back/Lost) and by email-link clicks. Writes to cp-state and cp-audit; on won-back, resets the customer’s score and clears the surfaced/contact fields; on lost, records the reason and marks the customer so the scorer stops surfacing them. Memory: 256 MB. Timeout: 15 s.
  • digest — optional EventBridge Scheduler target, weekly Friday 4pm. Reads cp-state for the watch band and the week’s outcomes; posts a short “watch list and outcomes so far” message to a configured Slack channel. No Bedrock; a plain summary table. Memory: 256 MB.
  • summary — EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’s cp-state and cp-audit; calls Bedrock Haiku 4.5 to write a one-paragraph owner narrative (flagged, reached, won back with recovered value, lost with top reasons); emails it via SES to the configured stakeholder list. Memory: 512 MB.

Storage

  • DynamoDB · cp-state — one row per customer. PK customer_id; attributes: score, band, reason, surfaced_date, last_contact, status (active/lost), owner. On-demand. No TTL — it’s the live state the scorer reads each week.
  • DynamoDB · cp-audit — one row per write action of any kind. PK (customer_id, ts); attributes: action (reached_out/won_back/lost/undo), by_user, before, after, notes (e.g. lost-reason, recovered value). On-demand. No TTL — this is the long-term outcome trail the summary counts from.
  • S3 · cp-list-source — mirrored CSV from the Drive customer list. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years.
  • S3 · cp-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
  • S3 · cp-order-feed — daily order exports from the store/billing tool. Lifecycle to Glacier at 30 days; expiry at 1 year (the derived columns live in the sheet, so the raw exports are short-lived).
  • S3 · cp-raw-mime — raw inbound MIME from forwarded support tickets. Lifecycle to Glacier at 30 days; expiry at 7 years.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Three callsites: mood-reader for ticket sentiment, handoff for the per-name plain reason, and summary for the monthly narrative. If a heavier monthly analysis is ever wanted (cohort patterns across reasons), summary can be promoted to anthropic.claude-sonnet-4-6-20250930-v1:0 via its Global profile — but Haiku is enough for the current paragraph.
  • Embeddings. Not used. The list is structured rows and the score is plain arithmetic; deterministic math beats vector retrieval here. No Knowledge Base, no S3 Vectors.
  • Quotas. Default account quotas are more than enough at SMB volume. The scorer itself doesn’t call Bedrock; the mood and reason calls are small and bursty around ticket arrival and the Monday run.

EventBridge Scheduler config

  • cp-weekly-runcron(0 8 ? * 2 *) (Mondays at 8am) in the SMB’s timezone. Target: scorer Lambda.
  • cp-drive-syncrate(15 minutes). Target: drive-sync Lambda.
  • cp-weekly-digestcron(0 16 ? * 6 *) (Fridays 4pm) in TZ. Target: digest Lambda.
  • cp-monthly-summarycron(0 9 ? * 2#1 *) (first Monday at 9am) in TZ. Target: summary Lambda.
  • Order import — the order-import Lambda is S3-PUT-driven on cp-order-feed, not Scheduler-driven, so it runs whenever the export lands. If the store can’t push on a schedule, a rate(1 day) Scheduler rule can pull instead.

SES inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. support-signals.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
  • SES inbound rule set cp-inbound-rules: one rule with recipient support-signals@your-company.com → spam scan → S3 PUT to s3://cp-raw-mime/<message-id> → stop. The S3 PUT triggers mood-reader.
  • SES outbound for the email-fallback lists and the monthly summary: verify a sender identity at churn@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • scorer role: s3:GetObject on the list, rules, and voice keys; dynamodb:Query + GetItem on cp-state; events:PutEvents on the default bus. No bedrock:*.
  • handoff role: s3:GetObject on the voice doc; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Slack bot token; ses:SendRawEmail from the verified sender; dynamodb:PutItem + Query on cp-state; outbound network access to slack.com.
  • outcome-handler role: dynamodb:PutItem + UpdateItem on cp-state and cp-audit; secretsmanager:GetSecretValue on the Slack signing secret; dynamodb:Query for snapshot reads on undo.
  • mood-reader role: s3:GetObject on cp-raw-mime; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Sheets-API service-account secret; outbound network to sheets.googleapis.com.
  • order-import and drive-sync roles: secretsmanager:GetSecretValue on the Google service-account secret; s3:GetObject/PutObject on the relevant buckets; outbound network to www.googleapis.com. No bedrock:*.

Slack interactive flow

The weekly list is posted via the Slack chat.postMessage Web API with Block Kit blocks containing one row per customer and three action buttons each. Button clicks are sent by Slack to the configured Interactivity request URL, which is the outcome-handler Function URL. outcome-handler verifies the Slack signing secret on the inbound request, parses the action_id (reached_out, won_back, lost), opens a modal if needed (Lost opens a small reason picker; Reached-out and Won-back are one-tap, Won-back optionally confirming the recovered value), and processes the response when the modal is submitted.

The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under cp/slack/bot-token. The signing secret is cp/slack/signing-secret.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: scorer Lambda failures > 0 in a week (the weekly run is the one piece that has to fire); handoff failure rate > 1% in 24h; outcome-handler signature-verification failures > 5/hour (might mean the Slack secret rotated).
  • X-Ray: off by default. Not worth the cost at SMB volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic cp-cost-alarm subscribed to the on-call admin’s email and Slack.

Config and secrets

Service-account credentials for Drive and Sheets APIs live in Secrets Manager under cp/drive/sa (one service account with scopes for both APIs). Slack bot token and signing secret under cp/slack/*. SES sender identity lives in IAM and the verified-domain config. The configured timezone, the signal weights and band cut-offs (mirrored from the rules doc for fast reads), the weekly cap, the contact pause window, and the admin fallback owner all live in Parameter Store under /cp/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

Whichever IaC you prefer. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both cp-list-source and cp-rules-source so a bad Drive edit can be rolled back in one click, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the weekly run in UTC after a CI rotation. Deploy with GitHub Actions using OIDC (no long-lived keys) and AWS SAM; a CDK Python stack also fits. Total deployable surface: around eight Lambdas, two DDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts