Engineering reference: the waitlist manager architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the conditional-write claim that guarantees no double-booking. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, SNS SMS, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is one freed slot going unfilled, not a regional outage. One AWS account dedicated to the waitlist manager (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. SMS in some regions requires registering a sender ID or a 10DLC number; budget a day for that with your carrier.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
drive-sync— EventBridge Scheduler target, fires every few minutes (defaultrate(3 minutes)). Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager underwl/drive/sa) to export the waitlist sheet as CSV and write tos3://wl-waitlist-source/waitlist.csvonly if the sheet has changed. The same pattern syncs the rules and voice docs tos3://wl-rules-source/. Memory: 256 MB. Timeout: 30 s.web-form— Lambda Function URL, public withAuthType: NONE. Backs the hosted join-the-waitlist page. Validates fields, runs a lightweight spam/honeypot check and a per-IP rate limit (token bucket inwl-ratelimitwith a short TTL), and writes a clean row to the Drive sheet via the Sheets API. Memory: 256 MB. Timeout: 15 s.intake-ses-parser— S3 PUT trigger ons3://wl-raw-mime/. Parses MIME; if there’s a PDF or image attachment, runs Textract viaStartDocumentTextDetection(async via SNS completion); otherwise reads the body text. Calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0viaglobal.anthropic.claude-haiku-4-5-20251001-v1:0) to propose a waitlist row, and posts an Approve/Edit/Discard card to the staff channel. For DOCX attachments (Textract doesn’t accept them), falls back topython-docx. Memory: 512 MB. Timeout: 60 s.offer-engine— EventBridge rule onwl.slot_freed(and on the internalwl.roll_onre-entry). Readss3://wl-waitlist-source/waitlist.csvand the rules and voice docs. Filters by eligibility (service, party size, date window, staff preference), sorts by the rules-doc order (join time, lifted by priority), readswl-offersfor live/tried state, decides on a move. Emits one event per slot that needs action:wl.make_offer,wl.roll_on, orwl.hand_back. No Bedrock calls. Memory: 512 MB. Timeout: 60 s.sender— EventBridge rule on the offer events. Resolves channel (mobile → SNS, else email → SES), checks quiet hours, writes the live offer towl-offerswithwindow_ends_at, mints a claim token (HMAC overslot_id|offer_seq|expwith a secret inwl/claim/signing-key), arms a one-off EventBridge Scheduler roll-on timer, formats the message from the voice template, and ships via SNSPublishor SESSendRawEmail. On quiet-hours defer, creates a Scheduler one-off that re-invokessenderat the next business minute (window armed only on actual send). Memory: 256 MB. Timeout: 30 s.claim-handler— Lambda Function URL, public withAuthType: NONE; verifies the HMAC claim token and its expiry. On claim, runs one DynamoDBUpdateItemonwl-slotswith a condition expressionstatus = :open AND live_offer_seq = :seq, settingstatus = booked, booked_by = entry_id; on success, confirms to the customer, deletes the roll-on Scheduler one-off, and updates the Drive sheet via the Sheets API. OnConditionalCheckFailedException, returns the “just filled” page. On decline, marks the offer declined inwl-offersand puts awl.roll_onevent. Writeswl-auditon every path. Memory: 256 MB. Timeout: 15 s.roll-on-timer— target of each one-off claim-window Scheduler rule. Re-readswl-slots; if the slot is stillopen, marks the live offertimed_outinwl-offersand puts awl.roll_onevent; if booked, no-op. Idempotent on(slot_id, offer_seq). Memory: 256 MB. Timeout: 15 s.housekeeping— EventBridge Scheduler target, daily. Expires stale entries past their latest date, removes booked customers from the active list, and reconciles any slot left in an in-between state (e.g. a Scheduler rule that failed to fire). No Bedrock. Memory: 256 MB.summary— EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’swl-offers,wl-slots, andwl-audit; calls Bedrock Haiku 4.5 to write a one-paragraph owner narrative (slots freed, slots filled, average time-to-fill, estimated recovered revenue); emails it via SES. Memory: 512 MB.
Storage
- DynamoDB ·
wl-slots— one row per freed slot. PKslot_id; attributes:service,slot_datetime,party_capacity,staff,status(open/booked/handed_back),live_offer_seq,booked_by. The conditional-write target. On-demand. - DynamoDB ·
wl-offers— one row per offer attempt. PK(slot_id, offer_seq); attributes:entry_id,channel,sent_at,window_ends_at(epoch),status(live/claimed/declined/timed_out). GSI onentry_idfor per-customer history. On-demand. - DynamoDB ·
wl-audit— one row per action of any kind. PK(slot_id, ts); attributes:entry_id,outcome(offered/claimed/declined/timed_out/handed_back),by,notes. No TTL — long-term audit trail. On-demand. - DynamoDB ·
wl-ratelimit— per-IP token bucket for the web form. PKip; short TTL on each item. On-demand. - S3 ·
wl-waitlist-source— mirrored CSV from the Drive waitlist sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 3 years. - S3 ·
wl-rules-source— mirrored rules and voice docs as plain text. Versioning enabled. - S3 ·
wl-raw-mime— raw inbound MIME from forwarded requests. Lifecycle to Glacier at 30 days; expiry at 1 year.
The no-double-booking guarantee
The whole safety property rests on a single DynamoDB conditional write. A slot lives in wl-slots with status = open and a live_offer_seq set by the sender. claim-handler issues UpdateItem with ConditionExpression: status = :open AND live_offer_seq = :seq. DynamoDB guarantees that condition is evaluated atomically against the current item, so for any number of concurrent claims only one can satisfy it — the rest get ConditionalCheckFailedException. The token binds a link to one offer_seq, so a stale link from a rolled-on offer fails the live_offer_seq check even before the status check. The roll-on timer re-reads status and is a no-op on a booked slot. No locks, no transactions across tables, no read-then-write race — one write decides it.
Bedrock
- Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites:intake-ses-parserfor inbound request parsing, andsummaryfor the monthly owner narrative. Sonnet 4.6 (anthropic.claude-sonnet-4-6-...) is available as a swap on the summary if richer analysis is ever wanted, but Haiku is plenty for a single paragraph. - Embeddings. Not used. The waitlist is structured rows; deterministic filter-and-sort beats vector retrieval here. No Knowledge Base, no S3 Vectors.
- Quotas. Default account quotas are more than enough at SMB volume. The offer path doesn’t call Bedrock; the parsing lane fires a few times a month at most.
EventBridge Scheduler config
wl-drive-sync—rate(3 minutes). Target:drive-syncLambda.wl-housekeeping—cron(0 3 * * ? *)in TZ. Target:housekeepingLambda.wl-monthly-summary—cron(0 9 ? * 2#1 *)(first Monday at 9am) in TZ. Target:summaryLambda.- Claim-window one-offs — created by
senderper live offer. Useat(YYYY-MM-DDTHH:MM:SS)atwindow_ends_atwith--action-after-completion DELETEso the rule self-cleans; targetroll-on-timer. Deleted early byclaim-handleron a successful claim. - Quiet-hours defer one-offs — created by
senderwhen an offer is held;at(...)at the next business minute, targetsender.
SES, SNS, inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
waitlist.your-business.com) toinbound-smtp.ap-southeast-1.amazonaws.com. - SES inbound rule set
wl-inbound-rules: one rule with recipientwaitlist@your-business.com→ spam scan → S3 PUT tos3://wl-raw-mime/<message-id>→ stop. The S3 PUT triggersintake-ses-parser. - SES outbound for email-fallback offers and the monthly summary: verify a sender identity at
waitlist@your-business.comwith DKIM and SPF on the parent domain. Out of sandbox by request. - SNS for the offer texts: an origination identity (sender ID or 10DLC long code) registered for the destination country; per-message
Publishwith an SMS attribute set toTransactionalfor delivery priority.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- offer-engine role:
s3:GetObjecton the waitlist, rules, and voice keys;dynamodb:Query+GetItemonwl-offers,wl-slots;events:PutEventson the default bus. Nobedrock:*. - sender role:
scheduler:CreateSchedule+DeleteSchedulefor the window and defer one-offs;secretsmanager:GetSecretValueon the claim signing key;sns:Publish;ses:SendRawEmailfrom the verified identity;dynamodb:PutItemonwl-offers. - claim-handler role:
dynamodb:UpdateItemonwl-slots(the conditional write);dynamodb:PutItemonwl-offersandwl-audit;scheduler:DeleteSchedulefor the window one-off;events:PutEventsfor roll-on;secretsmanager:GetSecretValueon the claim signing key and the Sheets service-account secret; outbound network tosheets.googleapis.com. - intake-ses-parser role:
s3:GetObjectonwl-raw-mime;textract:StartDocumentTextDetection;bedrock:InvokeModelon the Haiku ARN; outbound network for the staff-channel post. - drive-sync and web-form roles:
secretsmanager:GetSecretValueon the Google service-account secret;s3:PutObjecton the waitlist and rules buckets (drive-sync);dynamodb:UpdateItemonwl-ratelimit(web-form); outbound network towww.googleapis.com.
Freed-slot sources
Three producers put a wl.slot_freed event on the bus. A booking-tool webhook hits a thin Function URL (slot-webhook, signature-verified) on a cancellation. A no-show tap from the front-desk view hits the same Function URL with a no-show action. A calendar-opening sync (optional, if availability lives in Google Calendar) runs on the housekeeping tick and emits a freed-slot event for any newly added gap. All three normalize to the same event shape — service, slot_datetime, party_capacity, staff — so the engine has one input contract.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a metric for alerting. - Alarms:
claim-handler5xx > 0 (a customer can’t claim);roll-on-timerfailures > 0 (a slot could stall); SNS SMS delivery-failure rate > 5% (carrier or origination-id issue); offer-engine errors > 0. - SQS + DLQ: the EventBridge targets use an SQS dead-letter queue so a failed roll-on or send can be replayed instead of lost.
- X-Ray: off by default. Not worth the cost at SMB volume.
- AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic
wl-cost-alarmsubscribed to the on-call admin’s email.
Config and secrets
Service-account credentials for Drive and Sheets live in Secrets Manager under wl/drive/sa. The claim-token signing key is wl/claim/signing-key; the booking-webhook signing secret is wl/webhook/secret. The configured timezone, quiet-hours window, claim-window length, per-slot try cap, and down-fit policy all live in Parameter Store under /wl/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment. The rules and voice docs are read fresh from S3 per invocation so staff edits take effect on the next freed slot without a deploy.
Deploy
GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both wl-waitlist-source and wl-rules-source so a bad Drive edit rolls back in one click, and pin the EventBridge Scheduler timezone so a CI rotation can’t silently start the housekeeping tick in UTC. Total deployable surface: around nine Lambdas, four DynamoDB tables, three S3 buckets, one EventBridge rule set on the default bus (plus the Scheduler rules), one SES rule set, one SNS origination identity, one SQS DLQ, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.
All posts