Part 7 of 7 · Backup sentinel series ~8 min read

Engineering reference: the backup sentinel architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is missing a backup failure, not a regional outage. Worth noting the dependency loop: the sentinel watches your backups, so don’t host it in the same account or region as the systems it watches — run it in a dedicated account so a blast that takes out a workload doesn’t also blind the watcher. One AWS account dedicated to the sentinel keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. Cross-account read access to the buckets it inspects is granted via narrowly-scoped resource policies, never by running the sentinel inside the watched account.

Topology

AWS topology of the backup sentinel A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three intake lanes — a Drive sheet sync via the drive-sync Lambda triggered every 15 minutes by EventBridge Scheduler that mirrors the job-list CSV to s3://bk-registry-source/, an SES inbound rule set with action S3 PUT to s3://bk-raw-mime/ plus the parser Lambda intake-ses-parser that reads forwarded backup reports and calls Bedrock Haiku 4.5 to propose a job row for Slack approval, and a heartbeat Function URL Lambda that records check-ins from backup scripts and proposes a row the first time a new key is seen. Middle region: scheduled processing. The checker Lambda is triggered on a schedule by EventBridge Scheduler; it reads s3://bk-registry-source/jobs.csv, iterates rows, gathers each job's latest evidence by listing the target S3 prefix or reading the latest heartbeat, runs the three tests, looks up thresholds in s3://bk-rules-source/rules.txt, reads last state from DynamoDB bk-state, and emits one of three events to the EventBridge default bus per job whose state changed: bk.warn, bk.alert, or bk.escalate. Bottom region: dispatch and acknowledgment. The dispatch Lambda is triggered by an EventBridge rule on those three event types; it resolves the owner, checks quiet hours and the holiday calendar, fetches the alert template from s3://bk-rules-source/voice.txt, posts the message to Slack via chat.postMessage with action buttons or sends an email via SES outbound, and updates DynamoDB bk-state. Slack interactive button clicks land on a Function URL Lambda ack-handler that updates bk-state and bk-audit with the action: mark-fixed, snooze, or mute. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic bk-cost-alarm. A note at the bottom: it only reads — every interaction is logged to bk-audit. Ingress Lambda · drive-sync every 15 min Sheets API → s3://bk-registry-source/ jobs.csv SES inbound rule set bk-inbound-rules action: S3 PUT s3://bk-raw-mime/ trigger: intake-ses-parser Lambda · heartbeat Function URL records job check-ins new key → propose row → Slack proposal Drive job list canonical store · mirrored to S3 Scheduled processing EventBridge Scheduler rate(1 hour) or cron in TZ_NAME target: checker Lambda + deferred one-offs Lambda · checker reads CSV from S3 + rules.txt + voice.txt runs three tests, picks one of four states EventBridge default bus bk.warn bk.alert bk.escalate (all green → no event) Dispatch & acknowledgment Lambda · dispatch resolves owner, quiet hours, holidays; Slack postMessage or SES outbound Slack interactive DM with [Mark fixed] [Snooze] [Mute] button clicks → Function URL Lambda · ack-handler writes bk-state, bk-audit; mark-fixed sets pending-fixed, confirmed on next check It only reads — and every interaction is logged to bk-audit.
Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the job list), scheduled processing (the checker emitting events on state change), dispatch and acknowledgment (the alert ships and the owner’s response is recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • drive-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under bk/drive/sa) to export the job-list sheet as CSV and write to s3://bk-registry-source/jobs.csv only if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs to s3://bk-rules-source/. Memory: 256 MB. Timeout: 30 s.
  • heartbeat — Lambda Function URL, AuthType: NONE; authenticates each check-in with a per-job HMAC key (stored under bk/heartbeat/keys) passed in the request, so a leaked URL alone can’t forge a heartbeat. Records each check-in to bk-heartbeats with (job_id, ts, reported_size). The first time an unknown job_id checks in, posts a Slack interactive proposal to register it. Memory: 256 MB. Timeout: 15 s.
  • intake-ses-parser — S3 PUT trigger on s3://bk-raw-mime/. Parses MIME, extracts the report body and any attached log text, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) to propose a job row. Posts the proposal to Slack via chat.postMessage with Approve/Edit/Discard buttons. Reports are short text emails — no Textract or OCR is needed; if a backup tool only emits a structured JSON/XML status file, the parser reads it directly and skips the model call. Memory: 256 MB. Timeout: 30 s.
  • checker — EventBridge Scheduler target, on a schedule (default rate(1 hour); jobs with a weekly cadence are evaluated on a slower companion schedule to save cost). Reads s3://bk-registry-source/jobs.csv and the rules and voice docs. For each row: gathers evidence (S3 ListObjectsV2 on the job’s target prefix for newest key + size, or latest row from bk-heartbeats), runs the three tests, reads last state from bk-state, decides on a state. Emits one event per job whose state changed: bk.warn, bk.alert, or bk.escalate, with the job context as the event payload. All-green jobs emit nothing. Memory: 512 MB. Timeout: 120 s. No Bedrock calls.
  • dispatch — EventBridge rule on the three state events. Resolves owner, checks quiet hours and holiday calendar (with a per-job critical flag that can override both for escalations), formats the alert from the voice template, and ships via Slack chat.postMessage (bk/slack/bot-token in Secrets Manager) or SES SendRawEmail. On quiet-hours or holiday defer, creates a one-off EventBridge Scheduler rule that re-invokes dispatch at the next available business minute. Updates the job’s row in bk-state after a successful send. Memory: 256 MB. Timeout: 30 s.
  • ack-handler — Lambda Function URL, public with AuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive button clicks (Mark-fixed/Snooze/Mute) and by email-link clicks. Writes to bk-state and bk-audit; mark-fixed sets pending-fixed (the next checker run confirms a real healthy run before it flips to all-green), snooze writes snooze_until, mute writes muted: true. Memory: 256 MB. Timeout: 15 s.
  • summary — EventBridge Scheduler target, daily at 8am local. Reads the current bk-state across all jobs and the past day of bk-audit; calls Bedrock Haiku 4.5 to write a one-paragraph “all green / here’s what’s wrong” narrative, plus a per-job status line; posts to a configured Slack channel and emails via SES. Memory: 512 MB.

Storage

  • DynamoDB · bk-state — one row per job, current state. PK job_id; attributes: state (all_green/warn/alert/escalate/pending_fixed), since, last_evidence_ts, last_size, prev_size, snooze_until, muted, last_dispatched_state. On-demand. No TTL.
  • DynamoDB · bk-audit — one row per write action of any kind (state change, mark-fixed, snooze, mute, register). PK (job_id, ts); attributes: action, by_user, before, after. On-demand. No TTL — this is the long-term audit trail.
  • DynamoDB · bk-heartbeats — one row per heartbeat check-in. PK job_id; sort key ts; attributes: reported_size, source_ip. On-demand. TTL at 90 days — the checker only needs the most recent rows.
  • S3 · bk-registry-source — mirrored CSV from the Drive job-list sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years.
  • S3 · bk-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
  • S3 · bk-raw-mime — raw inbound MIME from forwarded backup reports. Lifecycle to Glacier at 30 days; expiry at 1 year.
  • Watched targets — the sentinel only holds s3:GetObject + s3:ListBucket (read-only) on the buckets where backups land, granted via resource policies on those buckets. It never writes to or deletes from them.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites: intake-ses-parser for proposing a job row from a forwarded report, and summary for the daily plain-English narrative. No Sonnet path is justified here — both tasks are short and structured, and Haiku 4.5 handles them well within budget.
  • Embeddings. Not used. The job list is structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors, no Titan embeddings.
  • Quotas. Default account quotas are more than enough at SMB volume. The checker itself doesn’t call Bedrock; the parsing lane fires a few times a month and the summary once a day.

EventBridge Scheduler config

  • bk-hourly-checkrate(1 hour). Target: checker Lambda (jobs whose cadence is daily or faster).
  • bk-slow-checkcron(0 9 * * ? *) in TZ. Target: checker Lambda in slow mode (weekly/monthly jobs only).
  • bk-drive-syncrate(15 minutes). Target: drive-sync Lambda.
  • bk-daily-summarycron(0 8 * * ? *) in TZ. Target: summary Lambda.
  • One-off rules — created on the fly by dispatch when a quiet-hours or holiday defer is needed. Use at(YYYY-MM-DDTHH:MM:SS) expressions with --action-after-completion DELETE so the rule self-cleans.

SES inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. backups.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
  • SES inbound rule set bk-inbound-rules: one rule with recipient backups@your-company.com → spam scan → S3 PUT to s3://bk-raw-mime/<message-id> → stop. The S3 PUT triggers intake-ses-parser.
  • SES outbound for the email-fallback alerts and daily summary: verify a sender identity at sentinel@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • checker role: s3:GetObject on the registry, rules, and voice keys; s3:GetObject + s3:ListBucket (read-only) on each watched target bucket/prefix; dynamodb:Query + GetItem + PutItem on bk-state and bk-heartbeats; events:PutEvents on the default bus. No bedrock:*, and no write or delete on any watched bucket.
  • dispatch role: scheduler:CreateSchedule for the deferred-dispatch one-offs; secretsmanager:GetSecretValue on the Slack bot-token secret; ses:SendRawEmail from the verified sender identity; dynamodb:PutItem on bk-state; outbound network access to slack.com.
  • ack-handler role: dynamodb:PutItem + UpdateItem on bk-state and bk-audit; secretsmanager:GetSecretValue on the Slack signing-secret; dynamodb:Query for state lookup.
  • intake-ses-parser role: s3:GetObject on bk-raw-mime; bedrock:InvokeModel on the Haiku ARN; secretsmanager:GetSecretValue on the Slack bot-token.
  • drive-sync role: secretsmanager:GetSecretValue on the Google service-account secret; s3:PutObject on the registry and rules buckets; outbound network to www.googleapis.com.
  • heartbeat role: dynamodb:PutItem on bk-heartbeats; secretsmanager:GetSecretValue on the per-job HMAC keys; secretsmanager:GetSecretValue on the Slack bot-token for new-job proposals.

Slack interactive flow

Alert messages are posted via the chat.postMessage Web API with Block Kit blocks containing the action buttons. Button clicks are sent by Slack to the configured Interactivity request URL, which is the ack-handler Function URL. ack-handler verifies the Slack signing secret on the inbound request, parses the action_id (mark_fixed, snooze, mute), opens a modal if needed (Snooze opens a days modal; Mark-fixed and Mute are one-tap), and processes the response when the modal is submitted. The same handler serves the email-fallback links via a signed query token.

The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under bk/slack/bot-token. The signing secret is bk/slack/signing-secret.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: checker Lambda failures > 0 in a day (the check is the one piece that has to run — a sentinel that silently stops checking is the exact failure it exists to prevent, so this alarm pages directly, not through the sentinel itself); dispatch failure rate > 1% in 24h; ack-handler signature-verification failures > 5/hour (might mean the Slack secret rotated).
  • Self-watch: the checker emits its own heartbeat to a CloudWatch metric on every run; a metric-absence alarm (no check in 90 minutes) pages the admin independently of the sentinel’s own alert path.
  • X-Ray: off by default. Not worth the cost at SMB volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic bk-cost-alarm subscribed to the on-call admin’s email and Slack.

Config and secrets

Service-account credentials for Drive and Sheets APIs live in Secrets Manager under bk/drive/sa. Slack bot token and signing secret under bk/slack/*. Per-job heartbeat HMAC keys under bk/heartbeat/keys. SES sender identity lives in IAM and the verified-domain config. The configured timezone, holiday list reference, quiet-hours window, default shrink threshold, and admin fallback owner all live in Parameter Store under /bk/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both bk-registry-source and bk-rules-source so a bad Drive edit can be rolled back in one click, grant the watched-bucket read access via resource policies in a separate stack so a target account’s removal can’t break the core, and version the EventBridge Scheduler timezone setting so you don’t accidentally start checking in UTC after a CI rotation. Total deployable surface: around seven Lambdas, three DDB tables, three S3 buckets owned by the sentinel, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts