A backup sentinel on AWS for a few dollars a month
Most small businesses set up backups once and never look at them again. The nightly database dump that started failing in March because a disk filled up. The file-server sync that has been copying an empty folder since somebody renamed a share. The cloud-drive export that quietly stopped when an access token expired. Each one looks fine from a distance — there’s a job, it’s scheduled, nobody touched it — right up until the morning you need to restore and find the last good copy is six months old. A backup you never checked isn’t a backup. This post walks through the design of a small watcher that checks all of them, every day, and warns the right person the moment one stops working.
Key takeaways
- Three sources for registered jobs: a Drive job list, an inbox forwarding lane, and a heartbeat lane.
- Every job ends in one of four states on each check: all green, warn, alert, or escalate.
- Three plain tests per job: did it finish, is it recent enough, is it the right size.
- Pings respect quiet hours and your holiday calendar. A cleared job stops pinging.
- It only watches and warns — it never deletes or restores. Designed on AWS for about $2/month.
The whole system on one page
Before any code, here’s the shape of what we’re designing.
What you set up once (the outside)
- Backup jobs. A Google Sheet in a Drive folder, one row per job: name, what it backs up (e.g.
orders database,shared drive,QuickBooks export), owner email, where the backup lands (an S3 path, a folder, a bucket), how often it should run, the smallest size you’d expect, and how late is too late. You can fill it in once and forget it; new jobs can also enter via two other lanes covered in Part 2 — an inbox-forwarding lane (forward the report your backup tool emails out and the sentinel proposes a row for one-tap approval) and a heartbeat lane (a job calls a private web address when it finishes, and the sentinel registers it the first time it hears from it). - A rules folder. Two short Google Docs in a Drive folder. The rules doc covers the freshness window for each job — how recent the last good run has to be before the sentinel worries — plus the smallest size that counts as a real backup, the owner per job (or per group), the escalation target if the owner doesn’t act, the quiet hours, and any holiday calendars to skip. A nightly database dump might allow 26 hours; a weekly file export might allow 8 days. The voice doc holds one alert message template per kind of problem — what the Slack DM or email actually says when a job is missing, stale, or too small.
- Owners. The people responsible for each job. Each owner has a Slack member ID (so the alert is a DM, not a public ping) or, if Slack isn’t set up for them, an email address. Pings land with the job name, exactly what failed, how late or how small it is, a link to the evidence, and buttons to mark it fixed, snooze it, or mute it.
What runs on every check (the inside)
- The job intake. Three sources feed the job list. The Drive sheet itself is the canonical store. New jobs can also be added via the inbox forwarding lane (forward the success-or-failure email your backup tool sends to
backups@your-company.com; the sentinel reads it and drops a one-tap approval card in the team’s Slack to confirm before the row is added) and the heartbeat lane (a job sends a quick ping to a private web address each time it finishes; the first ping proposes a row). - The checker. Runs on a schedule — hourly for jobs that run often, or a few times a day for the rest. Reads each job’s latest evidence: the report it left behind, the marker file in the backup folder, or the most recent heartbeat. For each job it runs three plain tests. Did it finish? — was there a successful run at all. Is it recent? — is the last good run inside the freshness window. Is it the right size? — is the latest backup at least the smallest expected size, and didn’t it shrink sharply from the run before. Then it picks one of four states. All green: all three tests pass. Warn: a soft miss — slightly late, or a little smaller than usual. Alert: a hard miss — missing, well past the window, or far too small. Escalate: still broken after the owner had a fair chance to act. The checker doesn’t call a model on the check — the test logic is plain Python.
- Dispatch. Reads the voice doc, formats the alert for the chosen state and problem, and sends it. Slack DMs go through the Slack API. Email goes through SES outbound. Both honor quiet hours (no pings between 6pm and 8am local by default) and the holiday calendar. Every dispatch writes a row in DynamoDB so the next check can tell whether the owner has acted. A daily summary writes one plain-English line per job — “all green” or “here’s what’s wrong” — so the owner gets a calm morning read even on quiet days.
In plain words
Your orders database is dumped to S3 every night at 2am by a small script. The sentinel knows this job should produce a file under s3://acme-backups/orders/ every 24 hours, at least 400 MB, owned by your ops lead Sam. One Tuesday the dump script dies halfway because the disk it writes to first is full. No file lands. At the 8am check the sentinel reads the folder, sees the newest file is from 2am yesterday — 30 hours old, past the 26-hour window — and the state flips from all green to alert. Sam gets a Slack DM: “Orders database backup — no new file since Mon 2:04am, 30h late (window 26h). Last good file 412 MB. [link to folder]” with Mark fixed / Snooze / Mute buttons. Sam clears the disk, re-runs the dump, sees a fresh 415 MB file land, and taps Mark fixed. The next check sees the fresh file, confirms it, and the job is green again.
The cost of running this is about $2 a month at SMB volume. The cost of not running it is the one morning you reach for a backup that has silently been failing since spring — and the newest copy you have is from before the thing you need to undo ever happened.
Design rules that shaped every decision
- It only reads. The sentinel never deletes, moves, or restores a backup — the worst it can do is send an alert.
- Four states, always. All green, warn, alert, escalate. There is no fifth.
- Three plain tests per job: did it finish, is it recent, is it the right size. No magic.
- Only a state change pings you. A still-broken job stays quiet until it crosses into escalation.
- The job list lives in Drive. Adding a job, changing an owner, or shifting a window doesn’t need a deploy.
- Every check and every action is logged. Audit a recovery next year and you can see every alert that went out.
Why this shape
Most teams “monitor” backups in one of three ways: they trust that the backup tool would email them on failure, they glance at a folder once in a while, or they assume green-once-means-green-forever. All three fail the same way. The failure email never arrives because the job died before it could send one. The glance happens the week everything’s fine and never the week it isn’t. And the assumption is just hope wearing a hard hat — a job that worked in January tells you nothing about June.
The setup above flips the default. Instead of waiting for a backup to announce its own failure, a small system looks at each backup every day and treats silence as a problem, not as good news. This is the “dead man’s switch” idea — a check that fires precisely when it stops hearing the all-clear. Alerts come with enough context that the owner knows what to fix without digging. They escalate cleanly when the owner is out. And they stop the moment somebody says “fixed.” The sentinel is invisible most days; visible only on the days a backup actually broke.
The next four posts walk through each piece in turn: how a backup job gets registered, how a backup gets checked, how a backup alert reaches its owner, and how a backup status gets cleared once it’s fixed. One diagram per post. A cost breakdown and a final engineering reference at the end.
All posts