Part 7 of 7 · Shift scheduler series ~8 min read

Engineering reference: the shift scheduler architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at small-team volume — the failure mode for a small team is a manager publishing a rota a few hours late, not a regional outage. One AWS account dedicated to the scheduler (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system.

Topology

AWS topology of the shift scheduler A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three intake lanes — a Drive sheet sync via the drive-sync Lambda triggered every 15 minutes by EventBridge Scheduler that mirrors the roster CSV to s3://ss-roster-source/, an SES inbound rule set with action S3 PUT to s3://ss-raw-mime/ plus the parser Lambda intake-timeoff that calls Bedrock Haiku 4.5 to read a plain-English time-off note into a date range for Slack approval, and a template-sync Lambda triggered weekly by EventBridge Scheduler that copies the standing weekly pattern into next week's tab and proposes any conflicts the same way. Middle region: scheduled processing. The drafter Lambda is triggered weekly on Thursday at 2pm local by EventBridge Scheduler; it reads s3://ss-roster-source/roster.csv, sorts the shifts, reads the rules from s3://ss-rules-source/rules.txt, tracks running hours in DynamoDB, and emits one of three events to the EventBridge default bus per shift that needs an action: ss.short_staffed, ss.held, or ss.draft_ready, plus the assembled draft for the manager. Bottom region: publish and approval. The publish Lambda is triggered by an EventBridge rule once the manager approves; it splits the rota per person, checks quiet hours, fetches the message template from s3://ss-rules-source/voice.txt, posts each person's own shifts to Slack via incoming webhook with a Request-swap button or sends an email via SES outbound, attaches calendar invites, and writes rows to DynamoDB ss-shifts. Slack interactive button clicks — Approve, Edit, Re-draft, and the swap actions Cover, Drop, Time-off — land on a Function URL Lambda action-handler that updates ss-shifts, ss-hours, and ss-audit and, on a swap, proposes a replacement back to the manager. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic ss-cost-alarm. A note at the bottom: the scheduler only proposes — and every interaction is logged to ss-audit. Ingress Lambda · drive-sync every 15 min Sheets API → s3://ss-roster-source/ roster.csv SES inbound rule set ss-inbound-rules action: S3 PUT s3://ss-raw-mime/ trigger: intake-timeoff Lambda · template-sync weekly copy standing pattern into next week's tab → Slack proposal Drive roster sheet canonical store · mirrored to S3 Scheduled processing EventBridge Scheduler cron(0 14 ? * 5 *) in TZ_NAME target: drafter Lambda + deferred one-offs Lambda · drafter reads CSV from S3 + rules.txt + voice.txt matches by rule, picks one of four EventBridge default bus ss.draft_ready ss.short_staffed ss.held (filled → in draft) Publish & approval Lambda · publish splits per person, quiet hours, invites; Slack webhook or SES outbound Slack interactive manager [Approve] swap [Cover][Drop] button clicks → Function URL Lambda · action-handler writes ss-shifts, ss-hours, ss-audit; on swap proposes a cover to the manager The scheduler only proposes — and every interaction is logged to ss-audit.
Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the roster), scheduled processing (the weekly drafter emitting events), publish and approval (the schedule ships after the manager approves and the team’s responses are recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • drive-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under ss/drive/sa) to export the roster sheet as CSV and write to s3://ss-roster-source/roster.csv only if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs to s3://ss-rules-source/. Memory: 256 MB. Timeout: 30 s.
  • template-sync — EventBridge Scheduler target, weekly (a few hours before the draft). Copies the standing weekly pattern tab into next week’s tab via the Sheets API, then checks the result against approved time-off; any clash becomes a Slack interactive proposal for the manager. Memory: 256 MB. Timeout: 30 s.
  • intake-timeoff — S3 PUT trigger on s3://ss-raw-mime/. Parses MIME, extracts the note text, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) to read the plain-English request into {start_date, end_date, reason}, resolving relative dates against the configured timezone. Posts the proposal to the manager’s Slack with Approve/Edit/Decline buttons. No Textract — the notes are plain text, not documents. Memory: 512 MB. Timeout: 30 s.
  • drafter — EventBridge Scheduler target, weekly on Thursday at 2pm local time (the schedule expression runs in TZ_NAME set to the team’s timezone, e.g. Asia/Singapore). Reads s3://ss-roster-source/roster.csv and the rules and voice docs. Sorts shifts, lists qualified-and-available candidates per shift, ranks by hours-below-target, places each, and tracks running hours in ss-hours. Emits the assembled draft plus one event per flagged shift: ss.short_staffed or ss.held; the whole draft goes to the manager as ss.draft_ready. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.
  • publish — EventBridge rule on the manager’s approval event. Splits the rota per person, checks quiet hours, formats each person’s own shifts from the voice template, attaches calendar invites, and ships via Slack incoming webhook (ss/slack/webhook in Secrets Manager) or SES SendRawEmail. On a quiet-hours defer, creates a one-off EventBridge Scheduler rule that re-invokes publish at the next reasonable minute. Writes rows to ss-shifts after a successful send. Memory: 256 MB. Timeout: 30 s.
  • action-handler — Lambda Function URL, public with AuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive clicks (Approve/Edit/Re-draft and the swap actions Cover/Drop/Time-off) and by email-link clicks. On approve, fires the publish event. On a swap, reuses the drafter’s candidate logic to propose a replacement back to the manager, then on the manager’s yes updates ss-shifts and ss-hours. Writes to ss-audit on every action. Memory: 256 MB. Timeout: 15 s.
  • digest — EventBridge Scheduler target, weekly Sunday 6pm. Reads ss-shifts for the coming week; sends each person a short reminder of their upcoming shifts and the manager a heads-up on any still-open shifts. No Bedrock; the message is a plain summary table. Memory: 256 MB.
  • summary — EventBridge Scheduler target, weekly Friday 5pm. Reads the week’s ss-hours and ss-audit; calls Bedrock Haiku 4.5 to write a short fairness narrative (each person’s hours against target, any drift); posts it to the manager’s Slack. Memory: 512 MB.

Storage

  • DynamoDB · ss-shifts — one row per published shift. PK (week_id, shift_id); attributes: day, start, end, role, assigned_to, status (filled/open/held), dispatched_via (slack/email). On-demand. No TTL.
  • DynamoDB · ss-hours — running placed-hours per person per week. PK person_id; sort key week_id; attributes: hours_placed, hours_target, cap. On-demand.
  • DynamoDB · ss-audit — one row per write action of any kind. PK (shift_id, ts); attributes: action (drafted/approved/swapped/dropped/timeoff), by_user, before, after. On-demand. No TTL — this is the long-term audit trail.
  • DynamoDB · ss-shifts-archive — archived weeks after they pass. Same shape as ss-shifts; PK (week_id, shift_id). On-demand.
  • S3 · ss-roster-source — mirrored CSV from the Drive roster sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 3 years.
  • S3 · ss-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
  • S3 · ss-raw-mime — raw inbound MIME from forwarded time-off notes. Lifecycle to Glacier at 30 days; expiry at 3 years.
  • S3 · ss-published — a snapshot of each approved weekly rota as published, kept for reference and for re-sending a person their week on request.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites: intake-timeoff for reading plain-English time-off notes, and summary for the weekly fairness narrative. The heavier anthropic.claude-sonnet-4-6 profile is wired but unused by default — the tasks here are light enough for Haiku, and a model isn’t on the hot path at all.
  • Embeddings. Not used. The roster is structured rows; rule-based matching beats vector retrieval here. No Knowledge Base, no S3 Vectors.
  • Quotas. Default account quotas are more than enough at small-team volume. The drafter itself doesn’t call Bedrock; the time-off lane fires a few times a week at most.

EventBridge Scheduler config

  • ss-weekly-draftcron(0 14 ? * 5 *) (Thursday 2pm) in the team’s timezone. Target: drafter Lambda.
  • ss-drive-syncrate(15 minutes). Target: drive-sync Lambda.
  • ss-template-synccron(0 10 ? * 5 *) (Thursday 10am, before the draft) in TZ. Target: template-sync Lambda.
  • ss-weekly-digestcron(0 18 ? * SUN *) in TZ. Target: digest Lambda.
  • ss-weekly-summarycron(0 17 ? * 6 *) (Friday 5pm) in TZ. Target: summary Lambda.
  • One-off rules — created on the fly by publish when a quiet-hours defer is needed. Use at(YYYY-MM-DDTHH:MM:SS) expressions with --action-after-completion DELETE so the rule self-cleans.

SES inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. timeoff.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
  • SES inbound rule set ss-inbound-rules: one rule with recipient timeoff@your-company.com → spam scan → S3 PUT to s3://ss-raw-mime/<message-id> → stop. The S3 PUT triggers intake-timeoff.
  • SES outbound for the email-fallback schedules: verify a sender identity at rota@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • drafter role: s3:GetObject on the roster, rules, and voice keys; dynamodb:Query + GetItem + PutItem on ss-hours; events:PutEvents on the default bus. No bedrock:*.
  • publish role: events:ListSchedules + CreateSchedule for the deferred-publish one-offs; secretsmanager:GetSecretValue on the Slack webhook secret; ses:SendRawEmail from the verified sender identity; dynamodb:PutItem on ss-shifts; outbound network access to hooks.slack.com.
  • action-handler role: dynamodb:PutItem on ss-shifts, ss-hours, and ss-audit; secretsmanager:GetSecretValue on the Sheets-API service-account secret; outbound network access to sheets.googleapis.com; dynamodb:Query for candidate lookup; dynamodb:BatchWriteItem for archiving a passed week to ss-shifts-archive.
  • intake-timeoff role: s3:GetObject on ss-raw-mime; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Slack webhook.
  • drive-sync and template-sync roles: secretsmanager:GetSecretValue on the Google service-account secret; s3:PutObject on the roster and rules buckets; outbound network to www.googleapis.com.

Slack interactive flow

The Slack incoming webhook is the simplest delivery surface but doesn’t support interactive button responses. So the manager-facing messages and the per-person schedules are posted via the chat.postMessage Web API instead, with Block Kit blocks containing the action buttons. Button clicks are sent by Slack to the configured Interactivity request URL, which is the action-handler Function URL. action-handler verifies the Slack signing secret on the inbound request, parses the action_id (approve, edit, redraft, cover, drop, timeoff), opens a modal if needed (Edit and Cover open modals; Approve is one-tap), and processes the response when the modal is submitted.

The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under ss/slack/bot-token. The signing secret is ss/slack/signing-secret.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: drafter Lambda failures > 0 in a week (the weekly draft is the one piece that has to run); publish failure rate > 1% in 24h; action-handler signature-verification failures > 5/hour (might mean the Slack secret rotated).
  • X-Ray: off by default. Not worth the cost at small-team volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic ss-cost-alarm subscribed to the on-call admin’s email and Slack.

Config and secrets

Service-account credentials for Drive, Sheets, and Calendar APIs all live in Secrets Manager under ss/drive/sa (one service account with scopes for all three APIs). Slack bot token, signing secret, and webhook URL all under ss/slack/*. SES sender identity lives in IAM and the verified-domain config. The configured timezone, quiet-hours window, default max-hours cap, rest gap, and admin fallback all live in Parameter Store under /ss/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both ss-roster-source and ss-rules-source so a bad Drive edit can be rolled back in one click, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the weekly draft in UTC after a CI rotation. SAM fits this surface well; CDK with a Python stack file also works. Total deployable surface: around eight Lambdas, four DDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts