Part 7 of 7 · Onboarding guide series ~8 min read

Engineering reference: the onboarding guide architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is a new customer missing a welcome email, not a regional outage. One AWS account dedicated to the guide (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.

Topology

AWS topology of the onboarding guide A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three enrollment lanes — a signup webhook on a Lambda Function URL named enroll that validates the payload and writes the row to the Drive sheet, a Drive sheet sync via the drive-sync Lambda triggered every 15 minutes by EventBridge Scheduler that mirrors the onboarding CSV to s3://og-list-source/, and an SES inbound rule set with action S3 PUT to s3://og-raw-mime/ plus the parser Lambda intake-ses-parser that calls Bedrock Haiku 4.5 to propose a row for Slack approval. Middle region: scheduled processing. The guide Lambda is triggered daily at 9am local by EventBridge Scheduler; it reads s3://og-list-source/onboarding.csv, iterates rows, computes days_since_signup per customer, looks up the step plan in s3://og-rules-source/rules.txt, reads send and step state from DynamoDB, and emits one of three events to the EventBridge default bus per customer that needs an action: og.step_due, og.nudge, or og.flag. Bottom region: send and acknowledgment. The sender Lambda is triggered by an EventBridge rule on those three event types; it resolves the channel, checks quiet hours and the weekend and holiday calendar, fetches the message template from s3://og-rules-source/voice.txt, sends the email via SES outbound with a one-click done link or posts a flag to Slack via incoming webhook, and writes a row to DynamoDB og-sends. The done link and the Slack flag buttons land on a Function URL Lambda done-handler that updates og-state with the step done or the action (finish, pause, hand off) and, on finish, updates the onboarding sheet via the Google Sheets API. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic og-cost-alarm. A note at the bottom: every message leaves with full context — and every interaction is logged to og-audit. Ingress Lambda · enroll Function URL webhook validates payload → Sheets API row, starts at day 0 Lambda · drive-sync every 15 min Sheets API → s3://og-list-source/ onboarding.csv SES inbound rule set og-inbound-rules action: S3 PUT s3://og-raw-mime/ trigger: intake-ses-parser Drive onboarding sheet canonical store · mirrored to S3 Scheduled processing EventBridge Scheduler cron(0 9 * * ? *) in TZ_NAME target: guide Lambda + deferred one-offs Lambda · guide reads CSV from S3 + rules.txt + voice.txt computes days, picks one of four moves EventBridge default bus og.step_due og.nudge og.flag (on track → no event) Send & acknowledgment Lambda · sender resolves channel, quiet hours, weekends; SES outbound or Slack webhook Customer + Slack email done link; flag [Pause][Hand off] clicks → Function URL Lambda · done-handler writes og-state, og-audit, and on finish updates the Sheet via Sheets API Every message leaves with full context — and every interaction is logged to og-audit.
Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the list), scheduled processing (the daily guide tick emitting events), send and acknowledgment (the message ships and the customer’s or owner’s response is recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • enroll — Lambda Function URL, AuthType: NONE; verifies a shared-secret header against og/webhook/secret in Secrets Manager. Triggered by your app on every signup. Validates the JSON body (email format, known plan, signup timestamp not in the future), normalizes it, and writes the row to the Drive sheet via the Sheets API. Idempotent on a client-supplied signup_id so a retried POST doesn’t enroll twice. Memory: 256 MB. Timeout: 15 s.
  • drive-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under og/drive/sa) to export the onboarding sheet as CSV and write to s3://og-list-source/onboarding.csv only if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs to s3://og-rules-source/. Memory: 256 MB. Timeout: 30 s.
  • intake-ses-parser — S3 PUT trigger on s3://og-raw-mime/. Parses MIME, extracts the text body of the forwarded welcome email, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) to propose a customer row (name, email, plan, signup date). Posts the proposal to Slack via the incoming webhook with Approve/Edit/Discard buttons. No Textract is needed — welcome emails are plain text, not scanned PDFs. Memory: 256 MB. Timeout: 30 s.
  • guide — EventBridge Scheduler target, daily at 9am local time (the schedule expression runs in TZ_NAME set to the SMB’s timezone, e.g. Asia/Singapore). Reads s3://og-list-source/onboarding.csv and the rules and voice docs. For each row, computes days_since_signup, reads step state from og-sends and og-state, decides on a move. Emits one event per row that needs action: og.step_due, og.nudge, or og.flag, with the customer context as the event payload. On-track customers emit nothing. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.
  • sender — EventBridge rule on the three move events. Resolves channel, checks quiet hours and the weekend/holiday calendar, formats the message from the voice template, and ships via SES SendRawEmail (customer step/nudge) or the Slack incoming webhook (og/slack/webhook in Secrets Manager) for an owner flag. On quiet-hours or weekend defer, creates a one-off EventBridge Scheduler rule that re-invokes sender at the next available business minute. For steps the rules doc marks as personalized, calls Bedrock Haiku 4.5 to lightly rewrite the template body before sending. Writes a row to og-sends after a successful send. Memory: 256 MB. Timeout: 30 s.
  • done-handler — Lambda Function URL, public with AuthType: NONE; verifies a signed token on the done link and a Slack signature on the flag-button requests. Triggered by customer done-link clicks and by Slack interactive button clicks (Pause/Hand off/Done). Writes to og-state and og-audit; on finish, updates the Drive sheet via the Sheets API and archives the old journey in og-sends-archive. Memory: 256 MB. Timeout: 15 s.
  • digest — EventBridge Scheduler target, weekly Sunday 6pm. Reads og-sends and og-state for the past week and the list; sends a digest message to a configured Slack channel summarizing who finished, who’s stuck, and who’s paused. No Bedrock; the message is a plain summary table. Memory: 256 MB.
  • summary — EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’s og-sends, og-state, and og-audit; calls Bedrock Haiku 4.5 to write a one-paragraph activation narrative (sign-ups, finishes, where people drop off); emails it via SES to the configured stakeholder list. Memory: 512 MB.

Storage

  • DynamoDB · og-sends — one row per message sent. PK (customer_id, step_id); attributes: sent_date, kind (step/nudge), channel (email/slack), recipient. On-demand. No TTL.
  • DynamoDB · og-state — one row per customer’s live progress. PK customer_id; attributes: steps_done (set), paused, finished, flagged, handed_off_to, paused_note. On-demand.
  • DynamoDB · og-audit — one row per write action of any kind. PK (customer_id, ts); attributes: action, by_user, before, after. On-demand. No TTL — this is the long-term audit trail.
  • DynamoDB · og-sends-archive — archived journeys after a finish. Same shape as og-sends; PK (customer_id, journey_id, step_id). On-demand.
  • S3 · og-list-source — mirrored CSV from the Drive onboarding sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years.
  • S3 · og-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
  • S3 · og-raw-mime — raw inbound MIME from forwarded welcome emails. Lifecycle to Glacier at 30 days; expiry at 7 years.
  • S3 · og-static — the small HTML for the done-link landing page and any images referenced in the email templates.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Three callsites: intake-ses-parser for the welcome-email parsing, sender for the occasional personalized rewrite, and summary for the monthly narrative. anthropic.claude-sonnet-4-6-20250930-v1:0 is wired but unused at this volume — reserved for a future richer rewrite path if one is justified.
  • Embeddings. Not used. The onboarding list is structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors.
  • Quotas. Default account quotas are more than enough at SMB volume. The guide itself doesn’t call Bedrock; the parsing and rewrite lanes fire a few times a day at most.

EventBridge Scheduler config

  • og-daily-tickcron(0 9 * * ? *) in the SMB’s timezone. Target: guide Lambda.
  • og-drive-syncrate(15 minutes). Target: drive-sync Lambda.
  • og-weekly-digestcron(0 18 ? * SUN *) in TZ. Target: digest Lambda.
  • og-monthly-summarycron(0 9 ? * 2#1 *) (first Monday at 9am) in TZ. Target: summary Lambda.
  • One-off rules — created on the fly by sender when a quiet-hours or weekend defer is needed. Use at(YYYY-MM-DDTHH:MM:SS) expressions with --action-after-completion DELETE so the rule self-cleans.

SES inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. welcome.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
  • SES inbound rule set og-inbound-rules: one rule with recipient welcome@your-company.com → spam scan → S3 PUT to s3://og-raw-mime/<message-id> → stop. The S3 PUT triggers intake-ses-parser.
  • SES outbound for the step, nudge, and wrap-up emails: verify a sender identity at hello@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request. Configure a configuration set with open/click tracking off and a bounce/complaint SNS topic so a hard-bounced customer email flips the row to paused automatically.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • guide role: s3:GetObject on the list, rules, and voice keys; dynamodb:Query + GetItem on og-sends, og-state; events:PutEvents on the default bus. No bedrock:*.
  • sender role: events:ListSchedules + CreateSchedule for the deferred-send one-offs; secretsmanager:GetSecretValue on the Slack webhook secret; ses:SendRawEmail from the verified sender identity; bedrock:InvokeModel on the Haiku ARN (for the rewrite path); dynamodb:PutItem on og-sends; outbound network access to hooks.slack.com.
  • done-handler role: dynamodb:PutItem on og-state and og-audit; secretsmanager:GetSecretValue on the Sheets-API service-account secret; outbound network access to sheets.googleapis.com; dynamodb:Query for state lookup; on finish, dynamodb:BatchWriteItem for archiving the journey to og-sends-archive.
  • enroll role: secretsmanager:GetSecretValue on the webhook secret and the Sheets service-account secret; outbound network to sheets.googleapis.com.
  • intake-ses-parser role: s3:GetObject on og-raw-mime; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Slack webhook.
  • drive-sync role: secretsmanager:GetSecretValue on the Google service-account secret; s3:PutObject on the list and rules buckets; outbound network to www.googleapis.com.

Slack interactive flow

The Slack incoming webhook is the simplest delivery surface but doesn’t support interactive button responses. So the owner flag messages are posted via the chat.postMessage Web API instead, with Block Kit blocks containing the action buttons. Button clicks are sent by Slack to the configured Interactivity request URL, which is the done-handler Function URL. done-handler verifies the Slack signing secret on the inbound request, parses the action_id (pause, hand_off, done), opens a modal if needed (Pause and Hand off open modals; Done is one-tap), and processes the response when the modal is submitted.

The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under og/slack/bot-token. The signing secret is og/slack/signing-secret.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: guide Lambda failures > 0 in a day (the daily tick is the one piece that has to run); sender failure rate > 1% in 24h; done-handler signature-verification failures > 5/hour (might mean the Slack secret rotated); SES bounce rate above the SES reputation threshold.
  • X-Ray: off by default. Not worth the cost at SMB volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic og-cost-alarm subscribed to the on-call admin’s email and Slack.

Config and secrets

Service-account credentials for Drive and Sheets APIs live in Secrets Manager under og/drive/sa (one service account with scopes for both APIs). The signup-webhook shared secret is og/webhook/secret. Slack bot token, signing secret, and webhook URL all under og/slack/*. SES sender identity lives in IAM and the verified-domain config. The configured timezone, holiday list reference, quiet-hours window, weekend setting, and onboarding-owner Slack ID all live in Parameter Store under /og/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

GitHub Actions with OIDC into a deploy role (no long-lived keys) running AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both og-list-source and og-rules-source so a bad Drive edit can be rolled back in one click, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the daily tick in UTC after a CI rotation. SAM fits cleanly here; CDK with a Python stack file also works. Total deployable surface: around eight Lambdas, four DDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts