How a backup status gets cleared

Key takeaways

Three actions per alert: mark fixed (clear, resume checks), snooze (quiet while you work), mute (job is paused).
Mark fixed only sticks if the next check actually finds a healthy run — you can’t mark a broken job green forever.
Each action updates the bk-state table and writes an audit row.
Snooze is bounded — you can only snooze a few times before the sentinel escalates anyway.
The buttons are a Slack interactive message backed by a Function URL.

Three actions on an alert

Fig 5. Three actions per alert, three different effects. Mark fixed clears it pending the next check’s confirmation. Snooze quiets the job while you work. Mute pauses a job that’s meant to be off. Every action writes to the audit trail.

Action 1: mark fixed (the most common)

Sam cleared the full disk and re-ran the dump. A fresh 415 MB file landed. He taps Mark fixed. A Function URL Lambda sets the job’s status in bk-state to pending-fixed and writes a fixed row to bk-audit with his name and the timestamp. The Slack message updates to “Marked fixed by Sam — confirming on next check.”

Here’s the important part: mark fixed doesn’t turn the job green on its own. The very next scheduled check has to actually find a healthy run — finished, recent, right size — before the status flips fully to all green. This is deliberate. If somebody taps the button out of optimism (“I re-ran it, should be fine”) but the job is still broken, the next check sees it’s still broken and the alert comes right back, now noting “marked fixed at 8:05am but still failing.” You can’t silence a broken backup by clicking a button — only a real, verified run clears it. That one rule is what keeps the sentinel honest.

Action 2: snooze (the deferral)

Some fixes take longer than the alert wants to wait. The vendor whose tool runs the backup is slow to respond. The disk replacement is on order. The person who knows the script is out until Thursday. Sam isn’t able to fix it right now, but he’s on it — he just needs the sentinel to be quiet about this one job for a day.

Snooze opens a small modal asking for the number of days, with a 1-day default and a max of 7. On save, a snooze_until value is written to the job’s row in bk-state. The next check reads that in the “muted or snoozed?” step from Part 3 and keeps the job quiet until the snooze ends — but it keeps checking, so the moment the job recovers on its own, the state flips to green without anyone tapping anything. When the snooze ends, the checker re-evaluates — if the job is still broken, the next alert may be an escalation.

Snooze is bounded. The rules doc has a configurable max_snoozes_per_break setting (default three). After that many snoozes on the same break, further snooze attempts are rejected with a “You’ve hit the snooze cap on this job; please fix or escalate” reply, and the next check alerts normally regardless. This exists because the most dangerous failure mode is snoozing a dead backup to nowhere — week after week — until you need it.

Action 3: mute (the “this one is supposed to be off”)

Sometimes a job is failing because it’s meant to be paused. The seasonal store you only back up in Q4. The old database you decommissioned last week but left on the list. The export you turned off on purpose while you migrate. The owner doesn’t want to fix it and doesn’t want to be reminded — it’s correct that it isn’t running.

Mute writes a muted: true flag to the job’s row in bk-state. The checker skips muted jobs entirely — no tests, no alerts — but the row stays on the list so it isn’t forgotten. A muted job shows up in the daily summary with a small “muted by Sam on 2026-06-10” note, so a paused backup never quietly becomes a forgotten backup. When the season comes around or the migration finishes, the owner un-mutes it and normal checks resume on the next run.

The difference between mute and snooze matters. Snooze says “this should be working, give me time.” Mute says “this is correctly not working, leave it alone.” Keeping them separate means a snoozed job always comes back to bite you if you forget it, while a muted job is an explicit, logged decision you can see in the summary every day.

Every action is logged, every action is reversible

The bk-audit table records every mark-fixed, snooze, and mute with the user who took the action, the timestamp, and a snapshot of the job’s state before and after. If a job gets muted by mistake (somebody thought it was decommissioned, it wasn’t), a rep can run an “undo last action” through a small admin command that reads the previous-state snapshot and restores it. The undo is itself an audit row, so the trail of changes stays clean.

This kind of reversibility matters most for the decisions you’ll only think about once. The next time that decommissioned database turns out to have been needed after all, the audit trail is the only record of who muted it and when. A backup monitor that lets things disappear silently is just a quieter version of the problem it’s meant to solve.

Next post: the cost breakdown. The whole pipeline above runs in coffee-money territory at SMB volume; Part 6 explains exactly where the dollars go and why it’s one of the cheapest systems in the series.

All posts