Part 7 of 7 · Cart recovery series ~8 min read

Engineering reference: the cart recovery architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SQS and DLQ wiring, EventBridge Scheduler config, the DynamoDB schemas, and the unsubscribe flow. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES outbound, Bedrock Global cross-Region inference, EventBridge Scheduler, and SQS are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup at SMB volume — the failure mode for a small store is a reminder that lands an hour late, not a regional outage. One AWS account dedicated to the cart recovery (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.

Topology

AWS topology of the cart recovery A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three intake lanes — a storefront webhook on a Lambda Function URL ingest-webhook that validates the event and pushes it to an SQS queue cr-events with a dead-letter queue, an intake Lambda triggered by the SQS queue that writes or updates the cart row in DynamoDB cr-state and schedules the per-cart wake-up, and a nightly export Lambda triggered by EventBridge Scheduler that reads the day's carts and writes a read-only Google Sheet via the Sheets API. Middle region: scheduled processing. The waiter Lambda is triggered per cart by an EventBridge Scheduler one-off at the due time; it reads the cart row from cr-state, reads send history from cr-sends, looks up the wait in s3://cr-rules-source/rules.txt, and emits one of two events to the EventBridge default bus per cart that needs an action: cr.first_reminder or cr.second_reminder. Bottom region: dispatch and unsubscribe. The sender Lambda is triggered by an EventBridge rule on those two event types; it resolves the address, checks quiet hours and the do-not-disturb list, fetches the template from s3://cr-rules-source/voice.txt, calls Bedrock Haiku 4.5 to polish one line with a deterministic fallback, sends the email via SES outbound, and writes a row to DynamoDB cr-sends. The shopper's one-click unsubscribe link lands on a Function URL Lambda unsub-handler that adds the email to the unsubscribe list and writes cr-audit. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic cr-cost-alarm. A note at the bottom: a checkout stops the chase automatically — and every interaction is logged to cr-audit. Ingress Lambda · ingest-webhook Function URL validates event → SQS cr-events + DLQ Lambda · intake SQS trigger writes cr-state row schedules wake-up cancels on checkout Lambda · nightly-export Scheduler, nightly reads day's carts Sheets API → Drive read-only sheet DynamoDB cr-state canonical cart list · one row per cart Scheduled processing EventBridge Scheduler one-off per cart at(due_time) target: waiter Lambda self-deletes after run Lambda · waiter reads cr-state row + cr-sends + rules.txt computes time, picks one of four moves EventBridge default bus cr.first_reminder cr.second_reminder (shopping → reschedule) (give up → no event) Dispatch & unsubscribe Lambda · sender resolves address, quiet hours, do-not- disturb; polish line; SES outbound Reminder email items + total + return link + unsubscribe → Function URL Lambda · unsub-handler adds email to the unsubscribe list, writes cr-audit; same path as owner A checkout stops the chase automatically — and every interaction is logged to cr-audit.
Fig 7. AWS topology, in three regions of the diagram: ingress (the webhook through a queue into the cart list), scheduled processing (the per-cart wake-up emitting events), dispatch and unsubscribe (the reminder ships and the opt-out is recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • ingest-webhook — Lambda Function URL, public with AuthType: NONE; verifies a shared-secret HMAC header from the storefront on the raw body. Validates the cart event shape and pushes it to the SQS queue cr-events, then returns 200 fast so the storefront isn’t blocked. Decoupling via SQS means a burst of cart traffic (a flash sale) never overruns the intake and nothing is dropped — failures land in the DLQ. Memory: 256 MB. Timeout: 10 s.
  • intake — SQS trigger on cr-events, batch size up to 10. For each event: upserts the cart row in cr-state keyed by cart_id; on an add/update, (re)schedules the per-cart wake-up via an EventBridge Scheduler one-off at the first wait; on a checkout event, flips status to bought and deletes the pending schedule; on a saved-link event, sets the saved_link flag. Idempotent on (cart_id, event_id) so an SQS redelivery is a no-op. Memory: 256 MB. Timeout: 30 s.
  • nightly-export — EventBridge Scheduler target, once a night. Scans cr-state for the day’s carts and writes them to a Google Sheet via the Sheets API (service-account credentials in Secrets Manager under cr/drive/sa). The sheet is read-only to the system; it never feeds back. Memory: 256 MB. Timeout: 60 s.
  • waiter — EventBridge Scheduler one-off target, fired per cart at the due time. Reads the cr-state row and cr-sends history, loads s3://cr-rules-source/rules.txt, computes time_since_abandon, and decides on a move: still_shopping (reschedule for the next wait), first_reminder / second_reminder (emit the matching event), or give_up (mark closed, emit nothing). Emits to the EventBridge default bus. Memory: 256 MB. Timeout: 30 s. No Bedrock calls.
  • sender — EventBridge rule on the two reminder events. Resolves the address, checks quiet hours and the do-not-disturb list, formats the email from the voice template, and makes one Bedrock Haiku 4.5 call to polish the opening line (with a deterministic fallback on timeout or error). Ships via SES SendRawEmail with a List-Unsubscribe + List-Unsubscribe-Post header for one-click opt-out. On a quiet-hours defer, creates a one-off Scheduler rule that re-invokes sender at the next sending minute. Writes a row to cr-sends after a successful send. Memory: 512 MB. Timeout: 30 s.
  • unsub-handler — Lambda Function URL, public with AuthType: NONE; serves the unsubscribe link from the email and the List-Unsubscribe-Post one-click. Adds the email to the unsubscribe list (a cr-state partition or a small cr-unsub table) and writes cr-audit. The owner’s suppress/unsubscribe/write-off actions from the export sheet hit the same handler with a signed admin token. Memory: 256 MB. Timeout: 15 s.
  • summary — EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’s cr-sends, cr-state, and cr-audit; calls Bedrock Haiku 4.5 to write a one-paragraph recovery narrative (carts seen, reminders sent, recovered, dollars won back, written off); emails it via SES to the configured stakeholder list. Memory: 512 MB.

Storage

  • DynamoDB · cr-state — one row per cart, the canonical list. PK cart_id; attributes: email, items, total, abandoned_at, status (open/bought/closed), saved_link, wait_override, schedule_name. On-demand. A GSI on email supports the do-not-disturb and unsubscribe lookups.
  • DynamoDB · cr-sends — one row per reminder sent. PK (cart_id, step); attributes: sent_at, channel, move (first_reminder/second_reminder), recipient. On-demand. A GSI on recipient backs the “recent send to this email” do-not-disturb check.
  • DynamoDB · cr-audit — one row per write action of any kind, including the automatic stop on checkout. PK (cart_id, ts); attributes: action, by_user (or system), before, after. On-demand. No TTL — this is the long-term audit trail.
  • DynamoDB · cr-unsub — the unsubscribe list. PK email; attributes: unsubscribed_at, source (shopper/owner). On-demand. No TTL.
  • S3 · cr-rules-source — the rules and voice docs mirrored from Drive as plain text. Versioning enabled, so a bad edit rolls back in one click.
  • SQS · cr-events — the cart-event buffer between the webhook and the intake. Standard queue; visibility timeout sized to the intake timeout; redrive to cr-events-dlq after 5 attempts.
  • SQS · cr-events-dlq — dead-letter queue for events the intake couldn’t process. A CloudWatch alarm on queue depth > 0 pages the admin.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites: sender for the one-line reminder polish (with a deterministic fallback), and summary for the monthly recovery narrative. Heavier reasoning isn’t needed anywhere, so Sonnet 4.6 isn’t wired in — Haiku 4.5 covers both paths.
  • Embeddings. Not used. Carts are structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors.
  • Quotas. Default account quotas are more than enough at SMB volume. The timing decision doesn’t call Bedrock; the polish fires only on a reminder that actually sends.

EventBridge Scheduler config

  • Per-cart wake-ups — created on the fly by intake with at(YYYY-MM-DDTHH:MM:SS) expressions in TZ_NAME, target waiter, with --action-after-completion DELETE so each rule self-cleans. waiter reschedules for the second wait when it sends the first reminder.
  • cr-nightly-exportcron(0 2 * * ? *) in TZ. Target: nightly-export Lambda.
  • cr-monthly-summarycron(0 9 ? * 2#1 *) (first Monday at 9am) in TZ. Target: summary Lambda.
  • Quiet-hours defers — created on the fly by sender when a send falls in the quiet window. Use at(...) with --action-after-completion DELETE.

SES outbound and the webhook

  • Verify a sender identity at shop@your-store.com with DKIM and SPF on the parent domain; out of sandbox by request. A custom MAIL FROM subdomain keeps alignment clean for deliverability.
  • SES configuration set cr-sends-config: event destination to CloudWatch for bounces and complaints; a complaint rate over threshold auto-adds the address to cr-unsub.
  • The storefront webhook posts to the ingest-webhook Function URL. The shared secret used for the HMAC header lives in Secrets Manager under cr/webhook/secret; rotate it without redeploying the storefront by accepting two valid secrets during a rotation window.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • ingest-webhook role: sqs:SendMessage on cr-events; secretsmanager:GetSecretValue on the webhook secret. Nothing else.
  • intake role: sqs:ReceiveMessage + DeleteMessage on cr-events; dynamodb:PutItem + UpdateItem on cr-state; scheduler:CreateSchedule + DeleteSchedule for the per-cart wake-ups; iam:PassRole on the Scheduler target role.
  • waiter role: s3:GetObject on the rules and voice keys; dynamodb:GetItem + Query on cr-state, cr-sends; scheduler:CreateSchedule for the second-wait reschedule; events:PutEvents on the default bus. No bedrock:*.
  • sender role: scheduler:CreateSchedule for quiet-hours defers; secretsmanager:GetSecretValue on no secret beyond config; bedrock:InvokeModel on the Haiku ARN; ses:SendRawEmail from the verified sender; dynamodb:PutItem on cr-sends; dynamodb:Query on cr-unsub and the cr-sends recipient GSI.
  • unsub-handler role: dynamodb:PutItem on cr-unsub and cr-audit; dynamodb:UpdateItem on cr-state for suppress/write-off; scheduler:DeleteSchedule to cancel a wake-up on suppress.
  • nightly-export and summary roles: dynamodb:Scan/Query on the relevant tables; secretsmanager:GetSecretValue on the Google service-account secret; ses:SendRawEmail (summary only); bedrock:InvokeModel on the Haiku ARN (summary only); outbound network to www.googleapis.com.

Unsubscribe and stop flow

There is exactly one unsubscribe list (cr-unsub) and one way onto it: the unsub-handler. The shopper’s one-click link, the RFC 8058 List-Unsubscribe-Post request their mail client sends, and the owner’s Unsubscribe action from the export sheet all hit the same handler. Suppress and write-off go through the same handler too, but they target cr-state (one cart) rather than cr-unsub (an email). The automatic stop on checkout is handled in intake — status flip plus schedule delete — and double-guarded by the bought-or-unsubscribed check in waiter, so a racing wake-up never sends to someone who just bought.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: cr-events-dlq depth > 0 (a cart event failed to process); sender failure rate > 1% in 24h; SES complaint rate over the safe threshold; waiter errors > 0 in an hour.
  • X-Ray: off by default. Not worth the cost at SMB volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic cr-cost-alarm subscribed to the on-call admin’s email.

Config and secrets

Service-account credentials for the Drive and Sheets APIs live in Secrets Manager under cr/drive/sa. The storefront webhook secret is cr/webhook/secret. The configured timezone, quiet-hours window, do-not-disturb window, and the size-to-wait map all live in Parameter Store under /cr/config/, with the human-editable copy mirrored from the Drive rules doc to s3://cr-rules-source/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

GitHub Actions with OIDC into a deploy role — no long-lived keys — running AWS SAM. The opinionated bits: turn on S3 versioning for cr-rules-source so a bad Drive edit rolls back in one click, give the SQS queue a generous redrive policy so a transient intake error retries before the DLQ, and version the EventBridge Scheduler timezone setting so you don’t accidentally start firing wake-ups in UTC after a CI rotation. Total deployable surface: around seven Lambdas, four DDB tables, one S3 bucket, two SQS queues, one EventBridge rule on the default bus (plus the per-cart Scheduler one-offs), one SES configuration set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your store, see Work with me.

All posts