Part 7 of 7 · Waitlist manager series ~8 min read

Engineering reference: the waitlist manager architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the conditional-write claim that guarantees no double-booking. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, SNS SMS, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is one freed slot going unfilled, not a regional outage. One AWS account dedicated to the waitlist manager (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. SMS in some regions requires registering a sender ID or a 10DLC number; budget a day for that with your carrier.

Topology

AWS topology of the waitlist manager A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three intake lanes — a Drive sheet sync via the drive-sync Lambda triggered every few minutes by EventBridge Scheduler that mirrors the waitlist CSV to s3://wl-waitlist-source/, an SES inbound rule set with action S3 PUT to s3://wl-raw-mime/ plus the parser Lambda intake-ses-parser that runs Textract on attachments and Bedrock Haiku 4.5 to propose a row for approval, and a web-form Function URL Lambda that validates fields and writes a clean row directly. Middle region: event processing. The offer engine Lambda is triggered by a wl.slot_freed event on the EventBridge default bus (from a booking-tool webhook, a no-show tap, or a new opening); it reads s3://wl-waitlist-source/waitlist.csv, filters by eligibility, sorts by the order in s3://wl-rules-source/rules.txt, reads offer state from DynamoDB wl-offers, and emits one of three events per freed slot that needs an action: wl.make_offer, wl.roll_on, or wl.hand_back. Bottom region: send and claim. The sender Lambda is triggered by an EventBridge rule on the offer events; it resolves the channel, checks quiet hours, writes a live offer with window_ends_at to wl-offers, arms an EventBridge Scheduler one-off roll-on timer, fetches the template from s3://wl-rules-source/voice.txt, and sends via SNS Publish for a text or SES outbound for an email. The claim link is a Function URL Lambda claim-handler that runs one conditional write against wl-slots to flip a slot from open to booked for exactly one claimant, cancels the timer, and updates the Drive sheet via the Sheets API; declines and time-outs route back as roll-on events. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic wl-cost-alarm. A note at the bottom: one conditional write guarantees no double-booking — and every offer, claim, and roll is logged to wl-audit. Ingress Lambda · drive-sync every few min Sheets API → s3://wl-waitlist-source/ waitlist.csv SES inbound rule set wl-inbound-rules action: S3 PUT s3://wl-raw-mime/ trigger: intake-ses-parser Lambda · web-form Function URL validates fields, drops spam → row to sheet Drive waitlist sheet canonical store · mirrored to S3 Event processing EventBridge bus wl.slot_freed from webhook, no-show tap, or new opening Lambda · offer-engine reads CSV from S3 + rules.txt + voice.txt filter, sort, picks one of four moves EventBridge default bus wl.make_offer wl.roll_on wl.hand_back (no fit → no event) Send & claim Lambda · sender resolves channel, quiet hours; arms window timer; SNS text or SES email Offer on phone link with [Claim] [Decline] + countdown link clicks → Function URL Lambda · claim-handler one conditional write open → booked; logs wl-audit, and updates the Sheet via Sheets API One conditional write guarantees no double-booking — and every step is logged to wl-audit.
Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the waitlist), event processing (the offer engine reacting to a freed slot and emitting move events), send and claim (the offer ships and the claim is resolved by one conditional write). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • drive-sync — EventBridge Scheduler target, fires every few minutes (default rate(3 minutes)). Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under wl/drive/sa) to export the waitlist sheet as CSV and write to s3://wl-waitlist-source/waitlist.csv only if the sheet has changed. The same pattern syncs the rules and voice docs to s3://wl-rules-source/. Memory: 256 MB. Timeout: 30 s.
  • web-form — Lambda Function URL, public with AuthType: NONE. Backs the hosted join-the-waitlist page. Validates fields, runs a lightweight spam/honeypot check and a per-IP rate limit (token bucket in wl-ratelimit with a short TTL), and writes a clean row to the Drive sheet via the Sheets API. Memory: 256 MB. Timeout: 15 s.
  • intake-ses-parser — S3 PUT trigger on s3://wl-raw-mime/. Parses MIME; if there’s a PDF or image attachment, runs Textract via StartDocumentTextDetection (async via SNS completion); otherwise reads the body text. Calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) to propose a waitlist row, and posts an Approve/Edit/Discard card to the staff channel. For DOCX attachments (Textract doesn’t accept them), falls back to python-docx. Memory: 512 MB. Timeout: 60 s.
  • offer-engine — EventBridge rule on wl.slot_freed (and on the internal wl.roll_on re-entry). Reads s3://wl-waitlist-source/waitlist.csv and the rules and voice docs. Filters by eligibility (service, party size, date window, staff preference), sorts by the rules-doc order (join time, lifted by priority), reads wl-offers for live/tried state, decides on a move. Emits one event per slot that needs action: wl.make_offer, wl.roll_on, or wl.hand_back. No Bedrock calls. Memory: 512 MB. Timeout: 60 s.
  • sender — EventBridge rule on the offer events. Resolves channel (mobile → SNS, else email → SES), checks quiet hours, writes the live offer to wl-offers with window_ends_at, mints a claim token (HMAC over slot_id|offer_seq|exp with a secret in wl/claim/signing-key), arms a one-off EventBridge Scheduler roll-on timer, formats the message from the voice template, and ships via SNS Publish or SES SendRawEmail. On quiet-hours defer, creates a Scheduler one-off that re-invokes sender at the next business minute (window armed only on actual send). Memory: 256 MB. Timeout: 30 s.
  • claim-handler — Lambda Function URL, public with AuthType: NONE; verifies the HMAC claim token and its expiry. On claim, runs one DynamoDB UpdateItem on wl-slots with a condition expression status = :open AND live_offer_seq = :seq, setting status = booked, booked_by = entry_id; on success, confirms to the customer, deletes the roll-on Scheduler one-off, and updates the Drive sheet via the Sheets API. On ConditionalCheckFailedException, returns the “just filled” page. On decline, marks the offer declined in wl-offers and puts a wl.roll_on event. Writes wl-audit on every path. Memory: 256 MB. Timeout: 15 s.
  • roll-on-timer — target of each one-off claim-window Scheduler rule. Re-reads wl-slots; if the slot is still open, marks the live offer timed_out in wl-offers and puts a wl.roll_on event; if booked, no-op. Idempotent on (slot_id, offer_seq). Memory: 256 MB. Timeout: 15 s.
  • housekeeping — EventBridge Scheduler target, daily. Expires stale entries past their latest date, removes booked customers from the active list, and reconciles any slot left in an in-between state (e.g. a Scheduler rule that failed to fire). No Bedrock. Memory: 256 MB.
  • summary — EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’s wl-offers, wl-slots, and wl-audit; calls Bedrock Haiku 4.5 to write a one-paragraph owner narrative (slots freed, slots filled, average time-to-fill, estimated recovered revenue); emails it via SES. Memory: 512 MB.

Storage

  • DynamoDB · wl-slots — one row per freed slot. PK slot_id; attributes: service, slot_datetime, party_capacity, staff, status (open/booked/handed_back), live_offer_seq, booked_by. The conditional-write target. On-demand.
  • DynamoDB · wl-offers — one row per offer attempt. PK (slot_id, offer_seq); attributes: entry_id, channel, sent_at, window_ends_at (epoch), status (live/claimed/declined/timed_out). GSI on entry_id for per-customer history. On-demand.
  • DynamoDB · wl-audit — one row per action of any kind. PK (slot_id, ts); attributes: entry_id, outcome (offered/claimed/declined/timed_out/handed_back), by, notes. No TTL — long-term audit trail. On-demand.
  • DynamoDB · wl-ratelimit — per-IP token bucket for the web form. PK ip; short TTL on each item. On-demand.
  • S3 · wl-waitlist-source — mirrored CSV from the Drive waitlist sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 3 years.
  • S3 · wl-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
  • S3 · wl-raw-mime — raw inbound MIME from forwarded requests. Lifecycle to Glacier at 30 days; expiry at 1 year.

The no-double-booking guarantee

The whole safety property rests on a single DynamoDB conditional write. A slot lives in wl-slots with status = open and a live_offer_seq set by the sender. claim-handler issues UpdateItem with ConditionExpression: status = :open AND live_offer_seq = :seq. DynamoDB guarantees that condition is evaluated atomically against the current item, so for any number of concurrent claims only one can satisfy it — the rest get ConditionalCheckFailedException. The token binds a link to one offer_seq, so a stale link from a rolled-on offer fails the live_offer_seq check even before the status check. The roll-on timer re-reads status and is a no-op on a booked slot. No locks, no transactions across tables, no read-then-write race — one write decides it.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites: intake-ses-parser for inbound request parsing, and summary for the monthly owner narrative. Sonnet 4.6 (anthropic.claude-sonnet-4-6-...) is available as a swap on the summary if richer analysis is ever wanted, but Haiku is plenty for a single paragraph.
  • Embeddings. Not used. The waitlist is structured rows; deterministic filter-and-sort beats vector retrieval here. No Knowledge Base, no S3 Vectors.
  • Quotas. Default account quotas are more than enough at SMB volume. The offer path doesn’t call Bedrock; the parsing lane fires a few times a month at most.

EventBridge Scheduler config

  • wl-drive-syncrate(3 minutes). Target: drive-sync Lambda.
  • wl-housekeepingcron(0 3 * * ? *) in TZ. Target: housekeeping Lambda.
  • wl-monthly-summarycron(0 9 ? * 2#1 *) (first Monday at 9am) in TZ. Target: summary Lambda.
  • Claim-window one-offs — created by sender per live offer. Use at(YYYY-MM-DDTHH:MM:SS) at window_ends_at with --action-after-completion DELETE so the rule self-cleans; target roll-on-timer. Deleted early by claim-handler on a successful claim.
  • Quiet-hours defer one-offs — created by sender when an offer is held; at(...) at the next business minute, target sender.

SES, SNS, inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. waitlist.your-business.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
  • SES inbound rule set wl-inbound-rules: one rule with recipient waitlist@your-business.com → spam scan → S3 PUT to s3://wl-raw-mime/<message-id> → stop. The S3 PUT triggers intake-ses-parser.
  • SES outbound for email-fallback offers and the monthly summary: verify a sender identity at waitlist@your-business.com with DKIM and SPF on the parent domain. Out of sandbox by request.
  • SNS for the offer texts: an origination identity (sender ID or 10DLC long code) registered for the destination country; per-message Publish with an SMS attribute set to Transactional for delivery priority.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • offer-engine role: s3:GetObject on the waitlist, rules, and voice keys; dynamodb:Query + GetItem on wl-offers, wl-slots; events:PutEvents on the default bus. No bedrock:*.
  • sender role: scheduler:CreateSchedule + DeleteSchedule for the window and defer one-offs; secretsmanager:GetSecretValue on the claim signing key; sns:Publish; ses:SendRawEmail from the verified identity; dynamodb:PutItem on wl-offers.
  • claim-handler role: dynamodb:UpdateItem on wl-slots (the conditional write); dynamodb:PutItem on wl-offers and wl-audit; scheduler:DeleteSchedule for the window one-off; events:PutEvents for roll-on; secretsmanager:GetSecretValue on the claim signing key and the Sheets service-account secret; outbound network to sheets.googleapis.com.
  • intake-ses-parser role: s3:GetObject on wl-raw-mime; textract:StartDocumentTextDetection; bedrock:InvokeModel on the Haiku ARN; outbound network for the staff-channel post.
  • drive-sync and web-form roles: secretsmanager:GetSecretValue on the Google service-account secret; s3:PutObject on the waitlist and rules buckets (drive-sync); dynamodb:UpdateItem on wl-ratelimit (web-form); outbound network to www.googleapis.com.

Freed-slot sources

Three producers put a wl.slot_freed event on the bus. A booking-tool webhook hits a thin Function URL (slot-webhook, signature-verified) on a cancellation. A no-show tap from the front-desk view hits the same Function URL with a no-show action. A calendar-opening sync (optional, if availability lives in Google Calendar) runs on the housekeeping tick and emits a freed-slot event for any newly added gap. All three normalize to the same event shape — service, slot_datetime, party_capacity, staff — so the engine has one input contract.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a metric for alerting.
  • Alarms: claim-handler 5xx > 0 (a customer can’t claim); roll-on-timer failures > 0 (a slot could stall); SNS SMS delivery-failure rate > 5% (carrier or origination-id issue); offer-engine errors > 0.
  • SQS + DLQ: the EventBridge targets use an SQS dead-letter queue so a failed roll-on or send can be replayed instead of lost.
  • X-Ray: off by default. Not worth the cost at SMB volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic wl-cost-alarm subscribed to the on-call admin’s email.

Config and secrets

Service-account credentials for Drive and Sheets live in Secrets Manager under wl/drive/sa. The claim-token signing key is wl/claim/signing-key; the booking-webhook signing secret is wl/webhook/secret. The configured timezone, quiet-hours window, claim-window length, per-slot try cap, and down-fit policy all live in Parameter Store under /wl/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment. The rules and voice docs are read fresh from S3 per invocation so staff edits take effect on the next freed slot without a deploy.

Deploy

GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both wl-waitlist-source and wl-rules-source so a bad Drive edit rolls back in one click, and pin the EventBridge Scheduler timezone so a CI rotation can’t silently start the housekeeping tick in UTC. Total deployable surface: around nine Lambdas, four DynamoDB tables, three S3 buckets, one EventBridge rule set on the default bus (plus the Scheduler rules), one SES rule set, one SNS origination identity, one SQS DLQ, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts