Engineering reference: the backup sentinel architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the Slack interactive flow. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is missing a backup failure, not a regional outage. Worth noting the dependency loop: the sentinel watches your backups, so don’t host it in the same account or region as the systems it watches — run it in a dedicated account so a blast that takes out a workload doesn’t also blind the watcher. One AWS account dedicated to the sentinel keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. Cross-account read access to the buckets it inspects is granted via narrowly-scoped resource policies, never by running the sentinel inside the watched account.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
drive-sync— EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager underbk/drive/sa) to export the job-list sheet as CSV and write tos3://bk-registry-source/jobs.csvonly if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs tos3://bk-rules-source/. Memory: 256 MB. Timeout: 30 s.heartbeat— Lambda Function URL,AuthType: NONE; authenticates each check-in with a per-job HMAC key (stored underbk/heartbeat/keys) passed in the request, so a leaked URL alone can’t forge a heartbeat. Records each check-in tobk-heartbeatswith(job_id, ts, reported_size). The first time an unknownjob_idchecks in, posts a Slack interactive proposal to register it. Memory: 256 MB. Timeout: 15 s.intake-ses-parser— S3 PUT trigger ons3://bk-raw-mime/. Parses MIME, extracts the report body and any attached log text, and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0viaglobal.anthropic.claude-haiku-4-5-20251001-v1:0) to propose a job row. Posts the proposal to Slack viachat.postMessagewith Approve/Edit/Discard buttons. Reports are short text emails — no Textract or OCR is needed; if a backup tool only emits a structured JSON/XML status file, the parser reads it directly and skips the model call. Memory: 256 MB. Timeout: 30 s.checker— EventBridge Scheduler target, on a schedule (defaultrate(1 hour); jobs with a weekly cadence are evaluated on a slower companion schedule to save cost). Readss3://bk-registry-source/jobs.csvand the rules and voice docs. For each row: gathers evidence (S3ListObjectsV2on the job’s target prefix for newest key + size, or latest row frombk-heartbeats), runs the three tests, reads last state frombk-state, decides on a state. Emits one event per job whose state changed:bk.warn,bk.alert, orbk.escalate, with the job context as the event payload. All-green jobs emit nothing. Memory: 512 MB. Timeout: 120 s. No Bedrock calls.dispatch— EventBridge rule on the three state events. Resolves owner, checks quiet hours and holiday calendar (with a per-jobcriticalflag that can override both for escalations), formats the alert from the voice template, and ships via Slackchat.postMessage(bk/slack/bot-tokenin Secrets Manager) or SESSendRawEmail. On quiet-hours or holiday defer, creates a one-off EventBridge Scheduler rule that re-invokesdispatchat the next available business minute. Updates the job’s row inbk-stateafter a successful send. Memory: 256 MB. Timeout: 30 s.ack-handler— Lambda Function URL, public withAuthType: NONE; verifies a Slack signature on the request body. Triggered by Slack interactive button clicks (Mark-fixed/Snooze/Mute) and by email-link clicks. Writes tobk-stateandbk-audit; mark-fixed setspending-fixed(the nextcheckerrun confirms a real healthy run before it flips to all-green), snooze writessnooze_until, mute writesmuted: true. Memory: 256 MB. Timeout: 15 s.summary— EventBridge Scheduler target, daily at 8am local. Reads the currentbk-stateacross all jobs and the past day ofbk-audit; calls Bedrock Haiku 4.5 to write a one-paragraph “all green / here’s what’s wrong” narrative, plus a per-job status line; posts to a configured Slack channel and emails via SES. Memory: 512 MB.
Storage
- DynamoDB ·
bk-state— one row per job, current state. PKjob_id; attributes:state(all_green/warn/alert/escalate/pending_fixed),since,last_evidence_ts,last_size,prev_size,snooze_until,muted,last_dispatched_state. On-demand. No TTL. - DynamoDB ·
bk-audit— one row per write action of any kind (state change, mark-fixed, snooze, mute, register). PK(job_id, ts); attributes:action,by_user,before,after. On-demand. No TTL — this is the long-term audit trail. - DynamoDB ·
bk-heartbeats— one row per heartbeat check-in. PKjob_id; sort keyts; attributes:reported_size,source_ip. On-demand. TTL at 90 days — the checker only needs the most recent rows. - S3 ·
bk-registry-source— mirrored CSV from the Drive job-list sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years. - S3 ·
bk-rules-source— mirrored rules and voice docs as plain text. Versioning enabled. - S3 ·
bk-raw-mime— raw inbound MIME from forwarded backup reports. Lifecycle to Glacier at 30 days; expiry at 1 year. - Watched targets — the sentinel only holds
s3:GetObject+s3:ListBucket(read-only) on the buckets where backups land, granted via resource policies on those buckets. It never writes to or deletes from them.
Bedrock
- Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites:intake-ses-parserfor proposing a job row from a forwarded report, andsummaryfor the daily plain-English narrative. No Sonnet path is justified here — both tasks are short and structured, and Haiku 4.5 handles them well within budget. - Embeddings. Not used. The job list is structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors, no Titan embeddings.
- Quotas. Default account quotas are more than enough at SMB volume. The checker itself doesn’t call Bedrock; the parsing lane fires a few times a month and the summary once a day.
EventBridge Scheduler config
bk-hourly-check—rate(1 hour). Target:checkerLambda (jobs whose cadence is daily or faster).bk-slow-check—cron(0 9 * * ? *)in TZ. Target:checkerLambda in slow mode (weekly/monthly jobs only).bk-drive-sync—rate(15 minutes). Target:drive-syncLambda.bk-daily-summary—cron(0 8 * * ? *)in TZ. Target:summaryLambda.- One-off rules — created on the fly by
dispatchwhen a quiet-hours or holiday defer is needed. Useat(YYYY-MM-DDTHH:MM:SS)expressions with--action-after-completion DELETEso the rule self-cleans.
SES inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
backups.your-company.com) toinbound-smtp.ap-southeast-1.amazonaws.com. - SES inbound rule set
bk-inbound-rules: one rule with recipientbackups@your-company.com→ spam scan → S3 PUT tos3://bk-raw-mime/<message-id>→ stop. The S3 PUT triggersintake-ses-parser. - SES outbound for the email-fallback alerts and daily summary: verify a sender identity at
sentinel@your-company.comwith DKIM and SPF on the parent domain. Out of sandbox by request.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- checker role:
s3:GetObjecton the registry, rules, and voice keys;s3:GetObject+s3:ListBucket(read-only) on each watched target bucket/prefix;dynamodb:Query+GetItem+PutItemonbk-stateandbk-heartbeats;events:PutEventson the default bus. Nobedrock:*, and no write or delete on any watched bucket. - dispatch role:
scheduler:CreateSchedulefor the deferred-dispatch one-offs;secretsmanager:GetSecretValueon the Slack bot-token secret;ses:SendRawEmailfrom the verified sender identity;dynamodb:PutItemonbk-state; outbound network access toslack.com. - ack-handler role:
dynamodb:PutItem+UpdateItemonbk-stateandbk-audit;secretsmanager:GetSecretValueon the Slack signing-secret;dynamodb:Queryfor state lookup. - intake-ses-parser role:
s3:GetObjectonbk-raw-mime;bedrock:InvokeModelon the Haiku ARN;secretsmanager:GetSecretValueon the Slack bot-token. - drive-sync role:
secretsmanager:GetSecretValueon the Google service-account secret;s3:PutObjecton the registry and rules buckets; outbound network towww.googleapis.com. - heartbeat role:
dynamodb:PutItemonbk-heartbeats;secretsmanager:GetSecretValueon the per-job HMAC keys;secretsmanager:GetSecretValueon the Slack bot-token for new-job proposals.
Slack interactive flow
Alert messages are posted via the chat.postMessage Web API with Block Kit blocks containing the action buttons. Button clicks are sent by Slack to the configured Interactivity request URL, which is the ack-handler Function URL. ack-handler verifies the Slack signing secret on the inbound request, parses the action_id (mark_fixed, snooze, mute), opens a modal if needed (Snooze opens a days modal; Mark-fixed and Mute are one-tap), and processes the response when the modal is submitted. The same handler serves the email-fallback links via a signed query token.
The Slack app needs chat:write, im:write, and the Interactivity URL configured. The bot token lives in Secrets Manager under bk/slack/bot-token. The signing secret is bk/slack/signing-secret.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms: checker Lambda failures > 0 in a day (the check is the one piece that has to run — a sentinel that silently stops checking is the exact failure it exists to prevent, so this alarm pages directly, not through the sentinel itself); dispatch failure rate > 1% in 24h; ack-handler signature-verification failures > 5/hour (might mean the Slack secret rotated).
- Self-watch: the checker emits its own heartbeat to a CloudWatch metric on every run; a metric-absence alarm (no check in 90 minutes) pages the admin independently of the sentinel’s own alert path.
- X-Ray: off by default. Not worth the cost at SMB volume.
- AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic
bk-cost-alarmsubscribed to the on-call admin’s email and Slack.
Config and secrets
Service-account credentials for Drive and Sheets APIs live in Secrets Manager under bk/drive/sa. Slack bot token and signing secret under bk/slack/*. Per-job heartbeat HMAC keys under bk/heartbeat/keys. SES sender identity lives in IAM and the verified-domain config. The configured timezone, holiday list reference, quiet-hours window, default shrink threshold, and admin fallback owner all live in Parameter Store under /bk/config/. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
GitHub Actions with OIDC into a deploy role (no long-lived keys) and AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for both bk-registry-source and bk-rules-source so a bad Drive edit can be rolled back in one click, grant the watched-bucket read access via resource policies in a separate stack so a target account’s removal can’t break the core, and version the EventBridge Scheduler timezone setting so you don’t accidentally start checking in UTC after a CI rotation. Total deployable surface: around seven Lambdas, three DDB tables, three S3 buckets owned by the sentinel, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.
All posts