Engineering reference: the cart recovery architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SQS and DLQ wiring, EventBridge Scheduler config, the DynamoDB schemas, and the unsubscribe flow. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES outbound, Bedrock Global cross-Region inference, EventBridge Scheduler, and SQS are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup at SMB volume — the failure mode for a small store is a reminder that lands an hour late, not a regional outage. One AWS account dedicated to the cart recovery (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
ingest-webhook— Lambda Function URL, public withAuthType: NONE; verifies a shared-secret HMAC header from the storefront on the raw body. Validates the cart event shape and pushes it to the SQS queuecr-events, then returns 200 fast so the storefront isn’t blocked. Decoupling via SQS means a burst of cart traffic (a flash sale) never overruns the intake and nothing is dropped — failures land in the DLQ. Memory: 256 MB. Timeout: 10 s.intake— SQS trigger oncr-events, batch size up to 10. For each event: upserts the cart row incr-statekeyed bycart_id; on an add/update, (re)schedules the per-cart wake-up via an EventBridge Scheduler one-off at the first wait; on a checkout event, flips status toboughtand deletes the pending schedule; on a saved-link event, sets thesaved_linkflag. Idempotent on(cart_id, event_id)so an SQS redelivery is a no-op. Memory: 256 MB. Timeout: 30 s.nightly-export— EventBridge Scheduler target, once a night. Scanscr-statefor the day’s carts and writes them to a Google Sheet via the Sheets API (service-account credentials in Secrets Manager undercr/drive/sa). The sheet is read-only to the system; it never feeds back. Memory: 256 MB. Timeout: 60 s.waiter— EventBridge Scheduler one-off target, fired per cart at the due time. Reads thecr-staterow andcr-sendshistory, loadss3://cr-rules-source/rules.txt, computestime_since_abandon, and decides on a move:still_shopping(reschedule for the next wait),first_reminder/second_reminder(emit the matching event), orgive_up(markclosed, emit nothing). Emits to the EventBridge default bus. Memory: 256 MB. Timeout: 30 s. No Bedrock calls.sender— EventBridge rule on the two reminder events. Resolves the address, checks quiet hours and the do-not-disturb list, formats the email from the voice template, and makes one Bedrock Haiku 4.5 call to polish the opening line (with a deterministic fallback on timeout or error). Ships via SESSendRawEmailwith aList-Unsubscribe+List-Unsubscribe-Postheader for one-click opt-out. On a quiet-hours defer, creates a one-off Scheduler rule that re-invokessenderat the next sending minute. Writes a row tocr-sendsafter a successful send. Memory: 512 MB. Timeout: 30 s.unsub-handler— Lambda Function URL, public withAuthType: NONE; serves the unsubscribe link from the email and theList-Unsubscribe-Postone-click. Adds the email to the unsubscribe list (acr-statepartition or a smallcr-unsubtable) and writescr-audit. The owner’s suppress/unsubscribe/write-off actions from the export sheet hit the same handler with a signed admin token. Memory: 256 MB. Timeout: 15 s.summary— EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’scr-sends,cr-state, andcr-audit; calls Bedrock Haiku 4.5 to write a one-paragraph recovery narrative (carts seen, reminders sent, recovered, dollars won back, written off); emails it via SES to the configured stakeholder list. Memory: 512 MB.
Storage
- DynamoDB ·
cr-state— one row per cart, the canonical list. PKcart_id; attributes:email,items,total,abandoned_at,status(open/bought/closed),saved_link,wait_override,schedule_name. On-demand. A GSI onemailsupports the do-not-disturb and unsubscribe lookups. - DynamoDB ·
cr-sends— one row per reminder sent. PK(cart_id, step); attributes:sent_at,channel,move(first_reminder/second_reminder),recipient. On-demand. A GSI onrecipientbacks the “recent send to this email” do-not-disturb check. - DynamoDB ·
cr-audit— one row per write action of any kind, including the automatic stop on checkout. PK(cart_id, ts); attributes:action,by_user(orsystem),before,after. On-demand. No TTL — this is the long-term audit trail. - DynamoDB ·
cr-unsub— the unsubscribe list. PKemail; attributes:unsubscribed_at,source(shopper/owner). On-demand. No TTL. - S3 ·
cr-rules-source— the rules and voice docs mirrored from Drive as plain text. Versioning enabled, so a bad edit rolls back in one click. - SQS ·
cr-events— the cart-event buffer between the webhook and the intake. Standard queue; visibility timeout sized to the intake timeout; redrive tocr-events-dlqafter 5 attempts. - SQS ·
cr-events-dlq— dead-letter queue for events the intake couldn’t process. A CloudWatch alarm on queue depth > 0 pages the admin.
Bedrock
- Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites:senderfor the one-line reminder polish (with a deterministic fallback), andsummaryfor the monthly recovery narrative. Heavier reasoning isn’t needed anywhere, so Sonnet 4.6 isn’t wired in — Haiku 4.5 covers both paths. - Embeddings. Not used. Carts are structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors.
- Quotas. Default account quotas are more than enough at SMB volume. The timing decision doesn’t call Bedrock; the polish fires only on a reminder that actually sends.
EventBridge Scheduler config
- Per-cart wake-ups — created on the fly by
intakewithat(YYYY-MM-DDTHH:MM:SS)expressions inTZ_NAME, targetwaiter, with--action-after-completion DELETEso each rule self-cleans.waiterreschedules for the second wait when it sends the first reminder. cr-nightly-export—cron(0 2 * * ? *)in TZ. Target:nightly-exportLambda.cr-monthly-summary—cron(0 9 ? * 2#1 *)(first Monday at 9am) in TZ. Target:summaryLambda.- Quiet-hours defers — created on the fly by
senderwhen a send falls in the quiet window. Useat(...)with--action-after-completion DELETE.
SES outbound and the webhook
- Verify a sender identity at
shop@your-store.comwith DKIM and SPF on the parent domain; out of sandbox by request. A customMAIL FROMsubdomain keeps alignment clean for deliverability. - SES configuration set
cr-sends-config: event destination to CloudWatch for bounces and complaints; a complaint rate over threshold auto-adds the address tocr-unsub. - The storefront webhook posts to the
ingest-webhookFunction URL. The shared secret used for the HMAC header lives in Secrets Manager undercr/webhook/secret; rotate it without redeploying the storefront by accepting two valid secrets during a rotation window.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- ingest-webhook role:
sqs:SendMessageoncr-events;secretsmanager:GetSecretValueon the webhook secret. Nothing else. - intake role:
sqs:ReceiveMessage+DeleteMessageoncr-events;dynamodb:PutItem+UpdateItemoncr-state;scheduler:CreateSchedule+DeleteSchedulefor the per-cart wake-ups;iam:PassRoleon the Scheduler target role. - waiter role:
s3:GetObjecton the rules and voice keys;dynamodb:GetItem+Queryoncr-state,cr-sends;scheduler:CreateSchedulefor the second-wait reschedule;events:PutEventson the default bus. Nobedrock:*. - sender role:
scheduler:CreateSchedulefor quiet-hours defers;secretsmanager:GetSecretValueon no secret beyond config;bedrock:InvokeModelon the Haiku ARN;ses:SendRawEmailfrom the verified sender;dynamodb:PutItemoncr-sends;dynamodb:Queryoncr-unsuband thecr-sendsrecipient GSI. - unsub-handler role:
dynamodb:PutItemoncr-unsubandcr-audit;dynamodb:UpdateItemoncr-statefor suppress/write-off;scheduler:DeleteScheduleto cancel a wake-up on suppress. - nightly-export and summary roles:
dynamodb:Scan/Queryon the relevant tables;secretsmanager:GetSecretValueon the Google service-account secret;ses:SendRawEmail(summary only);bedrock:InvokeModelon the Haiku ARN (summary only); outbound network towww.googleapis.com.
Unsubscribe and stop flow
There is exactly one unsubscribe list (cr-unsub) and one way onto it: the unsub-handler. The shopper’s one-click link, the RFC 8058 List-Unsubscribe-Post request their mail client sends, and the owner’s Unsubscribe action from the export sheet all hit the same handler. Suppress and write-off go through the same handler too, but they target cr-state (one cart) rather than cr-unsub (an email). The automatic stop on checkout is handled in intake — status flip plus schedule delete — and double-guarded by the bought-or-unsubscribed check in waiter, so a racing wake-up never sends to someone who just bought.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms:
cr-events-dlqdepth > 0 (a cart event failed to process); sender failure rate > 1% in 24h; SES complaint rate over the safe threshold; waiter errors > 0 in an hour. - X-Ray: off by default. Not worth the cost at SMB volume.
- AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic
cr-cost-alarmsubscribed to the on-call admin’s email.
Config and secrets
Service-account credentials for the Drive and Sheets APIs live in Secrets Manager under cr/drive/sa. The storefront webhook secret is cr/webhook/secret. The configured timezone, quiet-hours window, do-not-disturb window, and the size-to-wait map all live in Parameter Store under /cr/config/, with the human-editable copy mirrored from the Drive rules doc to s3://cr-rules-source/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
GitHub Actions with OIDC into a deploy role — no long-lived keys — running AWS SAM. The opinionated bits: turn on S3 versioning for cr-rules-source so a bad Drive edit rolls back in one click, give the SQS queue a generous redrive policy so a transient intake error retries before the DLQ, and version the EventBridge Scheduler timezone setting so you don’t accidentally start firing wake-ups in UTC after a CI rotation. Total deployable surface: around seven Lambdas, four DDB tables, one S3 bucket, two SQS queues, one EventBridge rule on the default bus (plus the per-cart Scheduler one-offs), one SES configuration set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your store, see Work with me.
All posts