Part 1 of 7 · Backup sentinel series ~5 min read

A backup sentinel on AWS for a few dollars a month

Most small businesses set up backups once and never look at them again. The nightly database dump that started failing in March because a disk filled up. The file-server sync that has been copying an empty folder since somebody renamed a share. The cloud-drive export that quietly stopped when an access token expired. Each one looks fine from a distance — there’s a job, it’s scheduled, nobody touched it — right up until the morning you need to restore and find the last good copy is six months old. A backup you never checked isn’t a backup. This post walks through the design of a small watcher that checks all of them, every day, and warns the right person the moment one stops working.

Key takeaways

  • Three sources for registered jobs: a Drive job list, an inbox forwarding lane, and a heartbeat lane.
  • Every job ends in one of four states on each check: all green, warn, alert, or escalate.
  • Three plain tests per job: did it finish, is it recent enough, is it the right size.
  • Pings respect quiet hours and your holiday calendar. A cleared job stops pinging.
  • It only watches and warns — it never deletes or restores. Designed on AWS for about $2/month.

The whole system on one page

Before any code, here’s the shape of what we’re designing.

System architecture: three sources, three pieces inside AWS At the top, three external boxes in a row. Far left, "Backup jobs" — a Google Drive job list naming each backup you run for a database, file store, or cloud drive, plus an inbox forwarding lane and a heartbeat lane that add new jobs to the list. Centre, "Rules and voice" — a Drive folder with a rules doc covering per-job freshness windows, smallest expected size, owners, and the escalation chain, plus a voice doc with the alert message templates. Far right, "Owners" — the team members responsible for each job; pings land in their Slack DMs or, if no Slack ID is set, their email. Each connects via an arrow to the AWS account container below. Backup jobs have an outgoing arrow into AWS. Rules and voice feed in to ground every alert. Owners receive pings with the job name, what failed, how late or how small it is, a link to the evidence, and buttons to clear or snooze. Inside the AWS account are three components in a row, mirroring the layout above. On the left, the Job intake — receives jobs from each source, reads forwarded backup reports, and writes the cleaned job into the list. In the middle, the Checker — runs on a schedule; reads each job's latest evidence; runs three tests: did it finish, is it recent, is it the right size; picks one of four states: all green, warn, alert, or escalate. On the right, the Dispatch — sends the alert via Slack or email, respects quiet hours and the holiday calendar, and tracks the state per job. Internal arrows flow left to right. A note at the bottom reads: it only watches and warns — the sentinel never deletes or restores a backup. Backup jobs list, inbox, heartbeat Rules and voice windows, owners, templates Owners where alerts land jobs in grounds alert with context AWS account Job intake read, normalize, add to job list Checker picks one of four: green, warn, alert, escalate Dispatch Slack or email, respects quiet hours job state It only watches and warns — the sentinel never deletes or restores a backup.
Fig 1. Three sources outside, three pieces inside AWS. Jobs flow in from a Drive job list, an inbox forwarding lane, and a heartbeat lane. The Checker runs on a schedule and picks one of four states. Dispatch sends the right alert to the right person at the right time.

What you set up once (the outside)

  • Backup jobs. A Google Sheet in a Drive folder, one row per job: name, what it backs up (e.g. orders database, shared drive, QuickBooks export), owner email, where the backup lands (an S3 path, a folder, a bucket), how often it should run, the smallest size you’d expect, and how late is too late. You can fill it in once and forget it; new jobs can also enter via two other lanes covered in Part 2 — an inbox-forwarding lane (forward the report your backup tool emails out and the sentinel proposes a row for one-tap approval) and a heartbeat lane (a job calls a private web address when it finishes, and the sentinel registers it the first time it hears from it).
  • A rules folder. Two short Google Docs in a Drive folder. The rules doc covers the freshness window for each job — how recent the last good run has to be before the sentinel worries — plus the smallest size that counts as a real backup, the owner per job (or per group), the escalation target if the owner doesn’t act, the quiet hours, and any holiday calendars to skip. A nightly database dump might allow 26 hours; a weekly file export might allow 8 days. The voice doc holds one alert message template per kind of problem — what the Slack DM or email actually says when a job is missing, stale, or too small.
  • Owners. The people responsible for each job. Each owner has a Slack member ID (so the alert is a DM, not a public ping) or, if Slack isn’t set up for them, an email address. Pings land with the job name, exactly what failed, how late or how small it is, a link to the evidence, and buttons to mark it fixed, snooze it, or mute it.

What runs on every check (the inside)

  • The job intake. Three sources feed the job list. The Drive sheet itself is the canonical store. New jobs can also be added via the inbox forwarding lane (forward the success-or-failure email your backup tool sends to backups@your-company.com; the sentinel reads it and drops a one-tap approval card in the team’s Slack to confirm before the row is added) and the heartbeat lane (a job sends a quick ping to a private web address each time it finishes; the first ping proposes a row).
  • The checker. Runs on a schedule — hourly for jobs that run often, or a few times a day for the rest. Reads each job’s latest evidence: the report it left behind, the marker file in the backup folder, or the most recent heartbeat. For each job it runs three plain tests. Did it finish? — was there a successful run at all. Is it recent? — is the last good run inside the freshness window. Is it the right size? — is the latest backup at least the smallest expected size, and didn’t it shrink sharply from the run before. Then it picks one of four states. All green: all three tests pass. Warn: a soft miss — slightly late, or a little smaller than usual. Alert: a hard miss — missing, well past the window, or far too small. Escalate: still broken after the owner had a fair chance to act. The checker doesn’t call a model on the check — the test logic is plain Python.
  • Dispatch. Reads the voice doc, formats the alert for the chosen state and problem, and sends it. Slack DMs go through the Slack API. Email goes through SES outbound. Both honor quiet hours (no pings between 6pm and 8am local by default) and the holiday calendar. Every dispatch writes a row in DynamoDB so the next check can tell whether the owner has acted. A daily summary writes one plain-English line per job — “all green” or “here’s what’s wrong” — so the owner gets a calm morning read even on quiet days.

In plain words

Your orders database is dumped to S3 every night at 2am by a small script. The sentinel knows this job should produce a file under s3://acme-backups/orders/ every 24 hours, at least 400 MB, owned by your ops lead Sam. One Tuesday the dump script dies halfway because the disk it writes to first is full. No file lands. At the 8am check the sentinel reads the folder, sees the newest file is from 2am yesterday — 30 hours old, past the 26-hour window — and the state flips from all green to alert. Sam gets a Slack DM: “Orders database backup — no new file since Mon 2:04am, 30h late (window 26h). Last good file 412 MB. [link to folder]” with Mark fixed / Snooze / Mute buttons. Sam clears the disk, re-runs the dump, sees a fresh 415 MB file land, and taps Mark fixed. The next check sees the fresh file, confirms it, and the job is green again.

The cost of running this is about $2 a month at SMB volume. The cost of not running it is the one morning you reach for a backup that has silently been failing since spring — and the newest copy you have is from before the thing you need to undo ever happened.

Design rules that shaped every decision

  • It only reads. The sentinel never deletes, moves, or restores a backup — the worst it can do is send an alert.
  • Four states, always. All green, warn, alert, escalate. There is no fifth.
  • Three plain tests per job: did it finish, is it recent, is it the right size. No magic.
  • Only a state change pings you. A still-broken job stays quiet until it crosses into escalation.
  • The job list lives in Drive. Adding a job, changing an owner, or shifting a window doesn’t need a deploy.
  • Every check and every action is logged. Audit a recovery next year and you can see every alert that went out.

Why this shape

Most teams “monitor” backups in one of three ways: they trust that the backup tool would email them on failure, they glance at a folder once in a while, or they assume green-once-means-green-forever. All three fail the same way. The failure email never arrives because the job died before it could send one. The glance happens the week everything’s fine and never the week it isn’t. And the assumption is just hope wearing a hard hat — a job that worked in January tells you nothing about June.

The setup above flips the default. Instead of waiting for a backup to announce its own failure, a small system looks at each backup every day and treats silence as a problem, not as good news. This is the “dead man’s switch” idea — a check that fires precisely when it stops hearing the all-clear. Alerts come with enough context that the owner knows what to fix without digging. They escalate cleanly when the owner is out. And they stop the moment somebody says “fixed.” The sentinel is invisible most days; visible only on the days a backup actually broke.

The next four posts walk through each piece in turn: how a backup job gets registered, how a backup gets checked, how a backup alert reaches its owner, and how a backup status gets cleared once it’s fixed. One diagram per post. A cost breakdown and a final engineering reference at the end.

All posts