Part 3 of 7 · Backup sentinel series ~5 min read

How a backup gets checked

On a schedule — hourly for jobs that run often, a few times a day for the rest — an EventBridge Scheduler rule fires the checker Lambda. The Lambda reads the job list, looks at one row at a time, gathers that job’s latest evidence, runs three plain tests, and decides whether to stay quiet or raise something — and if so, how loud. The whole decision is plain Python. No model. No vector search. Every threshold lives in the rules doc, where a rep can edit it without a deploy.

Key takeaways

  • The checker runs on a schedule via EventBridge Scheduler — hourly, or a few times a day per job.
  • Three tests per job: did it finish, is it recent enough, is it the right size. All thresholds live in the rules doc.
  • Four states per job, every check: all green, warn, alert, escalate.
  • DynamoDB holds the last state per job so only a real change pings the owner.
  • The checker itself never calls a model. The decision is entirely deterministic.

The decision flow, per job

Decision flow per job on every scheduled check A vertical decision flow diagram. At the top, an input box "Job from the list" with the row's name, where the backup lands, freshness window, owner, and last state from DynamoDB. Below that, a step "Gather latest evidence" — the newest file in the backup folder, the latest report, or the most recent heartbeat. Below that, a check "Job muted or snoozed?" — if yes, route to "All green" (do nothing this check). If no, continue. The next step "Run three tests" — did it finish, is it recent inside the window, is it at least the minimum size and didn't shrink sharply. The next step "How bad is the worst failing test?" — none failing means healthy; a soft miss (slightly late, a little small) means warn; a hard miss (missing, well past the window, far too small) means alert. If none fail, route to "All green." Otherwise look at the last state in DynamoDB to see whether this is new. If the job was already broken on the previous check and has now had a fair chance to be fixed, route to "Escalate" — ping the escalation target named in the rules doc in addition to the owner. If it's a fresh soft miss, route to "Warn"; a fresh hard miss routes to "Alert". Each terminal box — All green, Warn, Alert, Escalate — emits an event to EventBridge with the state and the job context. A note at the bottom: the rules doc holds every threshold; the checker's code only enforces them — change a window in the doc and the next check uses the new value. Job from the list name · lands · window · owner Step 1 Gather latest evidence file, report, or heartbeat Step 2 Muted or snoozed? read DDB bk-state table Step 3 Run three tests finished? recent? right size? Step 4 How bad is the worst miss? none → all green stuck → escalate Step 5 Soft miss or hard miss? read DDB bk-state table All green do nothing Warn soft miss Alert hard miss Escalate stuck, no fix if yes none stuck soft hard The rules doc holds every threshold — change a window and the next check uses the new value.
Fig 3. The checker’s decision tree, per job, per scheduled check. Five steps decide which of four states applies. The rules doc holds every threshold; the checker only enforces them.

Three tests: finished, recent, right size

Every job faces the same three questions on every check. They’re simple on purpose — the value is in asking them every single day, not in any one being clever.

Did it finish? A backup that started and crashed halfway is not a backup. The checker looks for a clear success signal: a report that says “completed,” a marker file the job writes only at the end, or a heartbeat that arrived after the run began. A run that started but never produced an end signal is treated as not finished.

Is it recent? The newest good run has to fall inside the job’s freshness window from the rules doc. A nightly dump might allow 26 hours — a little slack past 24 so a job that runs at 2:05am instead of 2:00am doesn’t trip. A weekly export might allow 8 days. If the last good run is older than the window, the job is stale, even if it once worked perfectly.

Is it the right size? Two parts. First, the latest backup must be at least the smallest size you’d expect — a 2 KB file where you expect 400 MB means the dump produced almost nothing. Second, it mustn’t have shrunk sharply from the run before — a backup that drops from 412 MB to 30 MB overnight usually means a table got dropped or a folder got emptied. The rules doc holds both the minimum size and the “how much shrink is suspicious” percentage (default: more than 50% smaller than the previous good run).

Windows and sizes live in the doc, not the code

The rules doc has one short section per job, or per group of similar jobs. Each section names the thresholds in plain prose: “Orders database: should run every 24 hours, allow up to 26; expect at least 350 MB; flag if it shrinks more than 50%. Shared drive sync: every 24 hours, allow 30; expect at least 5 GB. Weekly accounting export: every 7 days, allow 8; expect at least 10 MB.” The checker’s code reads these numbers; it doesn’t hard-code any of them. Loosen a window or raise a size floor by editing the doc, and the next check uses the new value — no deploy.

Four states, always

Every job, every check, lands in exactly one of four buckets. The names are simple on purpose.

  • All green. All three tests pass — finished, recent, right size — or the job is muted or snoozed. Do nothing. Most jobs, most checks, are green.
  • Warn. A soft miss: slightly late but not yet past a comfortable margin, or a little smaller than usual but above the floor. Send a low-key heads-up so the owner can glance at it. Record the new state in the bk-state DynamoDB table.
  • Alert. A hard miss: missing entirely, well past the freshness window, or far below the size floor / a sharp shrink. Send a clear alert with exactly what failed and a link to the evidence. Record the state.
  • Escalate. The job was already failing on a previous check and has had a fair chance to be fixed (default: still broken at the next day’s first check), and nobody has marked it fixed or snoozed it. Send to the escalation target named in the rules doc — usually the owner’s manager — in addition to the owner. A broken backup that nobody’s touching is one of the few cases where pushing harder is the right answer.

State that makes the decision quiet

The checker reads one DynamoDB table every run. bk-state holds the current state per job: (job_id, state, since, last_evidence_ts, last_size). With that one table, the decision is a few dozen lines of Python and zero magic. A given job with given evidence and a given window always produces the same state. And crucially, the dispatch in the next post only fires when the state changes — green-to-alert pings; alert-to-alert (still broken, not yet escalation time) stays silent. Re-running the check produces no extra pings, because the state in the table already reflects what the owner has seen.

A job that gets fixed is an explicit reset: when the next check sees all three tests pass again, the state flips back to all green and the “cleared” row is written for audit. Part 5 covers what the owner’s buttons do to that state in detail.

Why the check uses no model

The checker could call a model to write a smarter alert, or to “judge” whether a missing backup is really a problem. It doesn’t. Two reasons. First, the check should be the one part of the system that is utterly predictable — if the rules doc says a job must run inside 26 hours and it didn’t, the alert fires. A model in that loop introduces variance the team can’t reason about, and a backup monitor that sometimes decides a failure is fine is worse than useless. Second, model calls cost money, and most jobs most days are green, so the call would be wasted nine times out of ten.

Bedrock fires in exactly one place — the daily plain-English summary in Part 6 — turning the day’s green/warn/alert states into a calm paragraph. Not on the check. The checker itself is plain Python that reads evidence and writes states.

Next post: how an alert finds the right person, how quiet hours and holidays are honored, and what the owner’s buttons actually do.

All posts