Part 5 of 7 · Backup sentinel series ~5 min read

How a backup status gets cleared

An alert lands in Sam’s Slack DM at 8:03am. The orders database backup is 30 hours late. There are three buttons: Mark fixed, Snooze, Mute. What happens when he taps one? The honest answer is “it depends on what he actually did.” This post walks through the three things the sentinel can do when an owner acts — mark fixed, snooze, mute — and how the job’s state and the audit trail stay in sync.

Key takeaways

  • Three actions per alert: mark fixed (clear, resume checks), snooze (quiet while you work), mute (job is paused).
  • Mark fixed only sticks if the next check actually finds a healthy run — you can’t mark a broken job green forever.
  • Each action updates the bk-state table and writes an audit row.
  • Snooze is bounded — you can only snooze a few times before the sentinel escalates anyway.
  • The buttons are a Slack interactive message backed by a Function URL.

Three actions on an alert

Three actions on a backup alert A diagram showing one input on the left flowing through a small interactive Slack panel, then branching into three action paths. Far left: an "Alert in Slack DM" box showing a typical alert message — job name, what failed, how late or how small, link to evidence — with three Slack-button placeholders below: Mark fixed, Snooze, Mute. The owner taps one button. The middle column shows the three branches. Branch one, Mark fixed: the owner believes the job ran clean again; a Function URL Lambda sets the job to a pending-fixed status in the bk-state DynamoDB table and writes a fixed event to the audit trail. The status only flips fully to all green when the very next scheduled check confirms a healthy run — finished, recent, right size — so a job that is still broken can't be silenced by tapping the button. Branch two, Snooze: opens a small Slack modal asking how many days to snooze (default 1); on save, the Function URL Lambda writes a snooze-until value to bk-state that suppresses alerts for that job until the snooze ends. The job keeps being checked quietly. Snooze can only be applied a limited number of times per break (default three) before the next check ignores it and alerts anyway. Branch three, Mute: this job is meant to be paused — the owner stops a decommissioned or seasonal backup from being checked at all until they turn it back on; no alerts, no checks, but the row stays on the list. The right side shows the convergence: every action writes a row to the bk-audit DynamoDB table with timestamp, job id, action, by-user, and notes. A note at the bottom: mark fixed isn't a magic word — the next check has to confirm a real healthy run before the job goes green. Alert in Slack DM job, what failed, link [Mark fixed] [Snooze] [Mute] Action 1 Mark fixed • Set status: pending-fixed • Next check must confirm a real healthy run Action 2 Snooze • Modal: snooze N days (default 1) • bk-state suppresses alerts until snooze ends Action 3 Mute • Job is meant to be paused • No checks, no alerts — row stays on the list Audit trail DynamoDB bk-audit timestamp · job_id action · by-user notes Mark fixed isn’t a magic word — the next check has to confirm a real healthy run.
Fig 5. Three actions per alert, three different effects. Mark fixed clears it pending the next check’s confirmation. Snooze quiets the job while you work. Mute pauses a job that’s meant to be off. Every action writes to the audit trail.

Action 1: mark fixed (the most common)

Sam cleared the full disk and re-ran the dump. A fresh 415 MB file landed. He taps Mark fixed. A Function URL Lambda sets the job’s status in bk-state to pending-fixed and writes a fixed row to bk-audit with his name and the timestamp. The Slack message updates to “Marked fixed by Sam — confirming on next check.”

Here’s the important part: mark fixed doesn’t turn the job green on its own. The very next scheduled check has to actually find a healthy run — finished, recent, right size — before the status flips fully to all green. This is deliberate. If somebody taps the button out of optimism (“I re-ran it, should be fine”) but the job is still broken, the next check sees it’s still broken and the alert comes right back, now noting “marked fixed at 8:05am but still failing.” You can’t silence a broken backup by clicking a button — only a real, verified run clears it. That one rule is what keeps the sentinel honest.

Action 2: snooze (the deferral)

Some fixes take longer than the alert wants to wait. The vendor whose tool runs the backup is slow to respond. The disk replacement is on order. The person who knows the script is out until Thursday. Sam isn’t able to fix it right now, but he’s on it — he just needs the sentinel to be quiet about this one job for a day.

Snooze opens a small modal asking for the number of days, with a 1-day default and a max of 7. On save, a snooze_until value is written to the job’s row in bk-state. The next check reads that in the “muted or snoozed?” step from Part 3 and keeps the job quiet until the snooze ends — but it keeps checking, so the moment the job recovers on its own, the state flips to green without anyone tapping anything. When the snooze ends, the checker re-evaluates — if the job is still broken, the next alert may be an escalation.

Snooze is bounded. The rules doc has a configurable max_snoozes_per_break setting (default three). After that many snoozes on the same break, further snooze attempts are rejected with a “You’ve hit the snooze cap on this job; please fix or escalate” reply, and the next check alerts normally regardless. This exists because the most dangerous failure mode is snoozing a dead backup to nowhere — week after week — until you need it.

Action 3: mute (the “this one is supposed to be off”)

Sometimes a job is failing because it’s meant to be paused. The seasonal store you only back up in Q4. The old database you decommissioned last week but left on the list. The export you turned off on purpose while you migrate. The owner doesn’t want to fix it and doesn’t want to be reminded — it’s correct that it isn’t running.

Mute writes a muted: true flag to the job’s row in bk-state. The checker skips muted jobs entirely — no tests, no alerts — but the row stays on the list so it isn’t forgotten. A muted job shows up in the daily summary with a small “muted by Sam on 2026-06-10” note, so a paused backup never quietly becomes a forgotten backup. When the season comes around or the migration finishes, the owner un-mutes it and normal checks resume on the next run.

The difference between mute and snooze matters. Snooze says “this should be working, give me time.” Mute says “this is correctly not working, leave it alone.” Keeping them separate means a snoozed job always comes back to bite you if you forget it, while a muted job is an explicit, logged decision you can see in the summary every day.

Every action is logged, every action is reversible

The bk-audit table records every mark-fixed, snooze, and mute with the user who took the action, the timestamp, and a snapshot of the job’s state before and after. If a job gets muted by mistake (somebody thought it was decommissioned, it wasn’t), a rep can run an “undo last action” through a small admin command that reads the previous-state snapshot and restores it. The undo is itself an audit row, so the trail of changes stays clean.

This kind of reversibility matters most for the decisions you’ll only think about once. The next time that decommissioned database turns out to have been needed after all, the audit trail is the only record of who muted it and when. A backup monitor that lets things disappear silently is just a quieter version of the problem it’s meant to solve.

Next post: the cost breakdown. The whole pipeline above runs in coffee-money territory at SMB volume; Part 6 explains exactly where the dollars go and why it’s one of the cheapest systems in the series.

All posts