How a backup job gets registered
The sentinel only watches what’s on the job list. So the first job is making sure the list actually reflects every backup your business runs. There are three ways a job gets on it: somebody types a row in the Drive sheet, somebody forwards the report their backup tool emails out, or a backup script pings a private web address when it finishes. The first one is obvious. The other two exist because in real life nobody types a row in a sheet for the cron job they wrote two years ago and forgot.
Key takeaways
- Three intake lanes feed one job list: the Drive sheet, an inbox-forwarding lane, and a heartbeat lane.
- Forwarded reports are read by a small parser; Bedrock Haiku 4.5 turns the email into a proposed row.
- Every proposed row goes to the team’s Slack for one-tap approval before it lands on the list.
- A backup script can register itself by pinging a private Function URL when it finishes.
- The Drive sheet stays the canonical store. The other lanes are conveniences that write into it.
Three lanes into one job list
Lane 1: the Drive sheet itself
The simplest lane. Open the job-list sheet in Drive, add a row, save. The columns are short: name, what it backs up, owner email, where the backup lands, how often it should run, the smallest size you’d expect, and how late is too late. A small Lambda — drive-sync — runs every fifteen minutes, exports the sheet as plain CSV via the Drive API, and writes it to s3://bk-registry-source/jobs.csv if the sheet has changed since the last sync. The checker reads from S3, not Drive directly. That keeps Drive API calls predictable and gives you S3 versioning for free, so a bad bulk-edit can be rolled back in one click.
This lane covers the cases where you already know a job exists, you know where it lands, and you can spend thirty seconds typing it in. Most existing jobs go in this way during the initial setup.
Lane 2: inbox forwarding (the lane most teams actually use)
Most backup tools already email a report after each run — “Backup completed, 412 MB” or “Backup FAILED.” Set up a dedicated inbound address — something like backups@your-company.com — via Amazon SES, and forward those reports to it (or have the tool send them there directly). SES writes the raw message to s3://bk-raw-mime/. The S3 write triggers a parser Lambda. The Lambda reads the email body and any attached log.
Then a Bedrock Haiku 4.5 call reads the text and proposes a structured row: job name, what it backs up, where the backup lands, the size it reported, and how often this report seems to arrive. The model prompt is short: “Read this backup report. Propose a job row for the watch list. Return JSON only. Mark each field with a confidence score. Don’t invent a path or size that isn’t in the text.” The output goes to a Slack interactive message that pings the team: the proposed row, the confidence per field, and three buttons — approve, edit, discard. On approve, a Lambda writes the row to the Drive sheet via the Sheets API. On edit, the rep gets a fillable form pre-filled with the proposal. On discard, the message is logged and the email moved to a discarded prefix in S3 for audit.
The reason every proposed row goes to a human first is simple: a job the model misread — wrong path, wrong size threshold — is worse than a job that never made it onto the list. The misread one will quietly tell you everything is fine while watching the wrong folder.
Lane 3: heartbeat
Some backups are scripts you wrote, not tools you bought. A cron job that runs pg_dump, a rsync to a NAS, a nightly export of a cloud drive. For those, the cleanest signal is the job telling the sentinel directly that it finished. Add one line at the end of the script — a single web call to a private address, passing a short job key — and the script now “checks in” every time it completes.
That web call lands on a heartbeat Lambda behind a Function URL (a plain web address that runs a Lambda, with no API Gateway in front of it — cheaper and simpler). The first time a new job key checks in, the sentinel doesn’t know it yet, so it proposes a row in the same Slack flow as Lane 2 — one-tap approve to add it. After that, each heartbeat is recorded as evidence that the job finished, with its timestamp and the size the script reported. The power of this lane is the absence of a heartbeat: if a job that usually checks in every night goes quiet, the next check sees no recent heartbeat and flips the job to alert. The job doesn’t have to report its own failure — silence is the failure.
Why the job list stays the source of truth
Three lanes in, but only one place where the checker actually looks. That’s a deliberate constraint. If two lanes both wrote directly to the checker’s state, every “why did this alert fire?” question would mean checking three places. Funneling everything through the Drive sheet means there is exactly one row per job, and any rep can read or edit any of it without learning a new tool. The convenience lanes are first-class for getting jobs in, but they always pass through the sheet on the way.
Next post: how the checker actually reads each job’s evidence, runs its three tests, and picks one of four states.
All posts