What is a backup sentinel?

A small serverless system that makes sure your backups actually worked. It watches each backup job you point it at — databases, file stores, cloud drives — and on a schedule confirms each one finished, is recent enough, and is the right size. The moment a job is missing, stale, or shrank suspiciously, it warns the owner. It only watches and warns; it never deletes or restores anything. The owner can mark a job fixed, snooze it while they work on it, or mute a job that is meant to be paused, right from the alert.

How much does it cost to run?

About $1.50/month at typical small-business volume (around 30 backup jobs checked a few times a day). The fixed cost is essentially zero. The variable cost is dominated by the scheduled Lambda check that reads each job's latest report; Bedrock fires only on the plain-English daily summary, so it is a small sliver. At 200 jobs checked hourly the bill lands around $10.

Which AWS services does it use?

Lambda (Python 3.14, arm64) with a Function URL for the heartbeat endpoint and the clear/snooze buttons, EventBridge Scheduler for the check schedule, DynamoDB on-demand, S3 (with versioning), SES inbound + outbound, Secrets Manager, CloudWatch Logs (7-day retention), AWS Budgets, and Bedrock (Claude Haiku 4.5 via Global cross-Region inference) for the daily plain-English summary only. No API Gateway, no NAT Gateway, no always-on compute, no Knowledge Base.

Where does the job list live?

In a Google Sheet in a Drive folder you control. One row per job with name, what it backs up, owner email, where the backup lands, how often it should run, the smallest size you would expect, and how late is too late. A small drive-sync Lambda mirrors the sheet to S3 every 15 minutes; the sentinel reads from S3 to keep Drive API calls predictable and to get S3 versioning for free.

Does the sentinel touch my backups?

No. It only reads — it looks at the report each backup leaves behind, or the marker file in the backup folder, or a heartbeat ping the job sends when it finishes. It never deletes, moves, or restores a backup. The worst it can do is send an alert. That read-only rule is the whole point: a watcher that could change your backups would be one more thing that could break them.

How does an alert reach me without being noisy?

Each job picks one of four states on every check: all green, warn, alert, or escalate. Only a state change pings you — a job that is still broken tomorrow does not re-ping unless it crossed into escalation. Pings respect quiet hours (default 6pm–8am local) and the holiday calendar. A cleared job stops pinging until it breaks again. Snooze quiets a job while you fix it; you get a few snoozes before the sentinel escalates anyway.

What happens when I act on an alert?

Three buttons on every alert: Mark fixed (the job ran again and looks healthy; clear the alert and resume normal checks), Snooze (quiet this job for N days, default 1, while you work on it), and Mute (this job is meant to be paused; stop checking it until you turn it back on). Every action is recorded in the bk-audit DynamoDB table with timestamp, job, action, by-user, and a snapshot before and after, so the trail is auditable for years and nothing is silently lost.

Series · 7 parts Published June 10, 2026

Backup sentinel

A serverless watcher that makes sure your backups actually worked — because a backup you never checked isn’t a backup. It watches each backup job you point it at — databases, file stores, cloud drives — confirms each one finished, is recent, and is the right size, and warns the right owner the moment one is missing, stale, or shrank suspiciously. It only watches and warns; it never deletes or restores anything. Seven posts on the same system — one diagram at a time — with an engineering reference at the end.

01

A backup sentinel on AWS for a few dollars a month

The whole system on one page — a job intake, a checker, and a dispatch piece, plus the four states they share for every backup job.
02

How a backup job gets registered

Three lanes feed the job list — the Drive sheet itself, an inbox-forwarding lane that reads backup reports into proposed rows for one-tap approval, and a heartbeat lane where a job pings a private URL when it finishes.
03

How a backup gets checked

A scheduled check reads each job’s latest evidence, runs three plain tests — did it finish, is it recent, is it the right size — and picks one of four states: all green, warn, alert, escalate. No model on the check.
04

How a backup alert reaches its owner

Owner resolution per job, quiet hours, holiday calendars, Slack DMs with full context, email fallback, and the four guardrails between the checker’s chosen state and the actual ping landing.
05

How a backup status gets cleared

Three actions on the alert: mark fixed (the job ran clean again, resume checks), snooze (quiet it while you work, capped at a few per break), and mute (this job is meant to be paused). Every action is logged.
06

What the backup sentinel costs

Pennies a month at SMB volume. The checker runs a few times a day, calls no model on the check, and only fires Bedrock for the plain-English daily summary.
07

Engineering reference: the backup sentinel architecture

Same system, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, and the DynamoDB schemas.

What is a backup sentinel?: A small serverless system that makes sure your backups actually worked. It watches each backup job you point it at — databases, file stores, cloud drives — and on a schedule confirms each one finished, is recent enough, and is the right size. The moment a job is missing, stale, or shrank suspiciously, it warns the owner. It only watches and warns; it never deletes or restores anything.
How much does it cost to run?: About $1.50/month at typical small-business volume (around 30 backup jobs checked a few times a day). The fixed cost is essentially zero. The variable cost is dominated by the scheduled Lambda check that reads each job’s latest report; Bedrock fires only on the plain-English daily summary, so it’s a small sliver. At 200 jobs checked hourly the bill lands around $10.
Which AWS services does it use?: Lambda (Python 3.14, arm64) with a Function URL for the heartbeat endpoint and the clear/snooze buttons, EventBridge Scheduler for the check schedule, DynamoDB on-demand, S3 (with versioning), SES inbound + outbound, Secrets Manager, CloudWatch Logs (7-day retention), AWS Budgets, and Bedrock (Claude Haiku 4.5 via Global cross-Region inference) for the daily plain-English summary only. No API Gateway, no NAT Gateway, no always-on compute, no Knowledge Base.
Where does the job list live?: In a Google Sheet in a Drive folder. One row per job with name, what it backs up, owner email, where the backup lands, how often it should run, the smallest size you’d expect, and how late is too late. A small drive-sync Lambda mirrors the sheet to S3 every 15 minutes; the sentinel reads from S3 to keep Drive API calls predictable and to get S3 versioning for free.
Does the sentinel touch my backups?: No. It only reads — it looks at the report each backup leaves behind, or the marker file in the backup folder, or a heartbeat ping the job sends when it finishes. It never deletes, moves, or restores a backup. The worst it can do is send an alert. That read-only rule is the whole point: a watcher that could change your backups would be one more thing that could break them.
How does an alert reach me without being noisy?: Each job picks one of four states on every check — all green, warn, alert, or escalate. Only a state change pings you; a job that’s still broken tomorrow doesn’t re-ping unless it crossed into escalation. Pings respect quiet hours (default 6pm–8am local) and the holiday calendar. A cleared job stops pinging until it breaks again. Snooze quiets a job while you fix it; you get a few snoozes before the sentinel escalates anyway.
What happens when I act on an alert?: Three buttons on every alert: Mark fixed (the job ran again and looks healthy; clear the alert and resume normal checks), Snooze (quiet this job for N days, default 1, while you work on it), and Mute (this job is meant to be paused; stop checking it until you turn it back on). Every action is recorded in the bk-audit DynamoDB table with timestamp, job, action, by-user, and a before-and-after snapshot, so the trail is auditable for years.

All posts

Frequently asked questions

Other series