How a backup alert reaches its owner
The checker picked a state — warn, alert, or escalate. Now the dispatch Lambda has to figure out who to send it to, on what channel, at what time of day, and with what context attached. Get any of those wrong and the alert is worse than no alert: a 2am Slack ping, a vague “a backup failed,” a notification to somebody who left the company three months ago. Four small guardrails sit between the state and the actual ping.
Key takeaways
- Owner resolution: per-job override beats per-group default beats fallback to the configured admin.
- Slack DMs are the default; email is the fallback if no Slack ID is configured.
- Quiet hours and holiday calendars defer pings to the next available business hour.
- Every alert ships with the job, exactly what failed, how late or how small, a link to the evidence, and action buttons.
- Escalation pings the named target alongside the owner; the owner stays in the loop.
Four guardrails on every dispatch
Gate 1: resolve the owner
Three places the dispatch Lambda looks for the owner of a job, in order. First, the job list’s per-job owner_email column — if a row has a specific person assigned, that person owns it regardless of any group default. Second, the per-group default in the rules doc (“all database backups default to the ops lead”). Third, the configured admin fallback — the person who set up the sentinel and gets every unowned alert. The fallback should never fire in steady state; if it does, the daily summary names every job that hit the fallback so the rules doc can be fixed.
Once the dispatch knows which person to ping, it looks up their delivery preference. The voice doc maps each owner to a Slack member ID if one is set, otherwise to an email address. Slack is preferred because alerts feel like work-context messages, and a Slack DM with action buttons is more useful than an email link. Email is the fallback so nobody falls through the cracks.
Gate 2: quiet hours
A backup usually runs overnight, so a failure is often detected in the small hours — exactly when you don’t want a Slack ping waking somebody up for a job that can wait until the morning. Gate 2 reads the rules doc’s quiet-hours setting (default 6pm to 8am, configurable per business). If the current local time is in the quiet window, the dispatch creates a one-off EventBridge Scheduler rule that fires at the next business-hour minute and exits without sending. The Scheduler invokes the same dispatch Lambda with the same payload at the deferred time, where Gate 2 will let it through.
There’s one exception, and it’s configurable: an escalate state can be allowed to override quiet hours for jobs marked critical in the rules doc. A nightly dump of your billing database going dark for two days is the kind of thing some teams do want a 3am ping for. By default, even escalations wait for business hours — the override is opt-in per job.
Gate 3: holiday calendar
The rules doc lists the holidays you observe — either a static list (“Christmas Day, New Year’s Day, Independence Day...”) or a reference to a Google Calendar that holds them. Gate 3 checks the current local date against that list and, if it’s a configured holiday, defers the dispatch to the next non-holiday business day.
The list is on purpose — the sentinel won’t auto-detect a country’s public holidays for you. The failure modes are very different. A holiday you forgot to add fires a ping that lands on a closed laptop. A holiday in the list that’s no longer observed just delays a ping by one business day, which is fine. The trade-off favors keeping the list explicit. (As with quiet hours, a critical-job escalation can be set to ignore the holiday calendar too.)
Gate 4: compose with full context, then ship
The voice doc has one Slack message template per kind of problem — missing, stale, too small, shrank — each with placeholders for the job name, what failed, how late or how small, the last good run, and a link to the evidence. The dispatch Lambda fills the placeholders, attaches Mark fixed, Snooze, and Mute buttons, and ships the message via the Slack API. The bot token itself lives in Secrets Manager.
For email fallback, the same template is wrapped in a small HTML email with the same fields and links that, when clicked, hit a Function URL that records the action — the email equivalent of the Slack buttons.
An escalate state adds a second recipient: the escalation target named in the rules doc for that job. The owner is still pinged (the escalation isn’t a substitute for the owner’s ping — both go out), but the manager now sees it too. The escalate template is slightly different: it includes how long the job has been broken and the previous alert dates, so the manager has the trail at hand.
Every dispatch — Slack or email, owner or escalate — updates the job’s row in bk-state in DynamoDB. The next check reads that row and knows the owner has already been told, so it won’t re-ping the same unchanged failure.
Why the guardrails exist
None of these gates are exotic. They’re the kind of small care a thoughtful human would take if they were sending the alerts themselves — check who actually owns this, don’t ping at 3am for something that can wait, skip the day everyone’s off, include enough context that the recipient knows exactly what broke without opening a single folder. Putting them in code as four small sequential gates makes them part of the design, not a feature you’re trusting the writer of any one alert to remember.
Next post: how a backup status gets cleared once the owner has acted — what mark-fixed, snooze, and mute each do to the state, and how the audit trail stays clean.
All posts