How a backup gets checked

Key takeaways

The checker runs on a schedule via EventBridge Scheduler — hourly, or a few times a day per job.
Three tests per job: did it finish, is it recent enough, is it the right size. All thresholds live in the rules doc.
Four states per job, every check: all green, warn, alert, escalate.
DynamoDB holds the last state per job so only a real change pings the owner.
The checker itself never calls a model. The decision is entirely deterministic.

The decision flow, per job

Fig 3. The checker’s decision tree, per job, per scheduled check. Five steps decide which of four states applies. The rules doc holds every threshold; the checker only enforces them.

Three tests: finished, recent, right size

Every job faces the same three questions on every check. They’re simple on purpose — the value is in asking them every single day, not in any one being clever.

Did it finish? A backup that started and crashed halfway is not a backup. The checker looks for a clear success signal: a report that says “completed,” a marker file the job writes only at the end, or a heartbeat that arrived after the run began. A run that started but never produced an end signal is treated as not finished.

Is it recent? The newest good run has to fall inside the job’s freshness window from the rules doc. A nightly dump might allow 26 hours — a little slack past 24 so a job that runs at 2:05am instead of 2:00am doesn’t trip. A weekly export might allow 8 days. If the last good run is older than the window, the job is stale, even if it once worked perfectly.

Is it the right size? Two parts. First, the latest backup must be at least the smallest size you’d expect — a 2 KB file where you expect 400 MB means the dump produced almost nothing. Second, it mustn’t have shrunk sharply from the run before — a backup that drops from 412 MB to 30 MB overnight usually means a table got dropped or a folder got emptied. The rules doc holds both the minimum size and the “how much shrink is suspicious” percentage (default: more than 50% smaller than the previous good run).

Windows and sizes live in the doc, not the code

The rules doc has one short section per job, or per group of similar jobs. Each section names the thresholds in plain prose: “Orders database: should run every 24 hours, allow up to 26; expect at least 350 MB; flag if it shrinks more than 50%. Shared drive sync: every 24 hours, allow 30; expect at least 5 GB. Weekly accounting export: every 7 days, allow 8; expect at least 10 MB.” The checker’s code reads these numbers; it doesn’t hard-code any of them. Loosen a window or raise a size floor by editing the doc, and the next check uses the new value — no deploy.

Four states, always

Every job, every check, lands in exactly one of four buckets. The names are simple on purpose.

All green. All three tests pass — finished, recent, right size — or the job is muted or snoozed. Do nothing. Most jobs, most checks, are green.
Warn. A soft miss: slightly late but not yet past a comfortable margin, or a little smaller than usual but above the floor. Send a low-key heads-up so the owner can glance at it. Record the new state in the bk-state DynamoDB table.
Alert. A hard miss: missing entirely, well past the freshness window, or far below the size floor / a sharp shrink. Send a clear alert with exactly what failed and a link to the evidence. Record the state.
Escalate. The job was already failing on a previous check and has had a fair chance to be fixed (default: still broken at the next day’s first check), and nobody has marked it fixed or snoozed it. Send to the escalation target named in the rules doc — usually the owner’s manager — in addition to the owner. A broken backup that nobody’s touching is one of the few cases where pushing harder is the right answer.

State that makes the decision quiet

The checker reads one DynamoDB table every run. bk-state holds the current state per job: (job_id, state, since, last_evidence_ts, last_size). With that one table, the decision is a few dozen lines of Python and zero magic. A given job with given evidence and a given window always produces the same state. And crucially, the dispatch in the next post only fires when the state changes — green-to-alert pings; alert-to-alert (still broken, not yet escalation time) stays silent. Re-running the check produces no extra pings, because the state in the table already reflects what the owner has seen.

A job that gets fixed is an explicit reset: when the next check sees all three tests pass again, the state flips back to all green and the “cleared” row is written for audit. Part 5 covers what the owner’s buttons do to that state in detail.

Why the check uses no model

The checker could call a model to write a smarter alert, or to “judge” whether a missing backup is really a problem. It doesn’t. Two reasons. First, the check should be the one part of the system that is utterly predictable — if the rules doc says a job must run inside 26 hours and it didn’t, the alert fires. A model in that loop introduces variance the team can’t reason about, and a backup monitor that sometimes decides a failure is fine is worse than useless. Second, model calls cost money, and most jobs most days are green, so the call would be wasted nine times out of ten.

Bedrock fires in exactly one place — the daily plain-English summary in Part 6 — turning the day’s green/warn/alert states into a calm paragraph. Not on the check. The checker itself is plain Python that reads evidence and writes states.

Next post: how an alert finds the right person, how quiet hours and holidays are honored, and what the owner’s buttons actually do.

All posts