Part 2 of 7 · Backup sentinel series ~4 min read

How a backup job gets registered

The sentinel only watches what’s on the job list. So the first job is making sure the list actually reflects every backup your business runs. There are three ways a job gets on it: somebody types a row in the Drive sheet, somebody forwards the report their backup tool emails out, or a backup script pings a private web address when it finishes. The first one is obvious. The other two exist because in real life nobody types a row in a sheet for the cron job they wrote two years ago and forgot.

Key takeaways

  • Three intake lanes feed one job list: the Drive sheet, an inbox-forwarding lane, and a heartbeat lane.
  • Forwarded reports are read by a small parser; Bedrock Haiku 4.5 turns the email into a proposed row.
  • Every proposed row goes to the team’s Slack for one-tap approval before it lands on the list.
  • A backup script can register itself by pinging a private Function URL when it finishes.
  • The Drive sheet stays the canonical store. The other lanes are conveniences that write into it.

Three lanes into one job list

Three intake lanes funnel into one job list A diagram with three vertical lane columns at the top and a single unified row at the bottom. Lane one, Drive sheet: somebody types a row directly into the Google Sheet that holds the job list; the drive-sync Lambda mirrors the sheet to S3 every 15 minutes, and the sentinel reads from there. Lane two, Inbox forwarding: somebody forwards the success-or-failure report their backup tool emails to a dedicated address, backups-at-your-company; SES writes the raw message to S3; a parser Lambda reads the email and calls Bedrock Haiku 4.5 to extract job name, what it backs up, where it lands, the expected size, and how often it runs; the proposed row gets posted to the team's Slack with Approve and Edit buttons; on approve, the row is added to the Drive sheet via the Sheets API. Lane three, Heartbeat: a backup script adds one line that calls a private Function URL when it finishes, passing a job key; the first time the sentinel hears from a new key it proposes a row in the same Slack flow; after that, each heartbeat is recorded as evidence the job finished. All three lanes converge on the same Drive sheet, which the drive-sync Lambda keeps mirrored to S3 for the checker to read. A note at the bottom: the Drive sheet stays the source of truth — the other lanes are conveniences that propose rows for it. Lane 1 · manual Drive sheet • Somebody types a row directly • drive-sync mirrors to S3 every 15 min • Checker reads from S3 • Source of truth stays in Drive Lane 2 · SES + Haiku Inbox forwarding • Forward report to backups-address • SES writes message to S3 • Haiku 4.5 reads it; proposes a job row • Slack one-tap approve → sheet Lane 3 · self-register Heartbeat • Job pings a private URL when it finishes • New job key → proposes a row • Same Slack approve flow as Lane 2 • Later pings → recorded as evidence Drive job list (source of truth) name · backs up · owner · where it lands · how often · min size · how late is too late drive-sync mirrors it to S3 every 15 min — checker reads from S3 to checker, on schedule The Drive sheet stays the source of truth — the other lanes are conveniences that propose rows for it.
Fig 2. Three lanes converge on one Drive sheet. The sheet is the source of truth; the inbox lane and the heartbeat lane are conveniences that propose rows for human approval. The drive-sync Lambda mirrors the sheet to S3 so the checker can read it without hitting Drive on every run.

Lane 1: the Drive sheet itself

The simplest lane. Open the job-list sheet in Drive, add a row, save. The columns are short: name, what it backs up, owner email, where the backup lands, how often it should run, the smallest size you’d expect, and how late is too late. A small Lambda — drive-sync — runs every fifteen minutes, exports the sheet as plain CSV via the Drive API, and writes it to s3://bk-registry-source/jobs.csv if the sheet has changed since the last sync. The checker reads from S3, not Drive directly. That keeps Drive API calls predictable and gives you S3 versioning for free, so a bad bulk-edit can be rolled back in one click.

This lane covers the cases where you already know a job exists, you know where it lands, and you can spend thirty seconds typing it in. Most existing jobs go in this way during the initial setup.

Lane 2: inbox forwarding (the lane most teams actually use)

Most backup tools already email a report after each run — “Backup completed, 412 MB” or “Backup FAILED.” Set up a dedicated inbound address — something like backups@your-company.com — via Amazon SES, and forward those reports to it (or have the tool send them there directly). SES writes the raw message to s3://bk-raw-mime/. The S3 write triggers a parser Lambda. The Lambda reads the email body and any attached log.

Then a Bedrock Haiku 4.5 call reads the text and proposes a structured row: job name, what it backs up, where the backup lands, the size it reported, and how often this report seems to arrive. The model prompt is short: “Read this backup report. Propose a job row for the watch list. Return JSON only. Mark each field with a confidence score. Don’t invent a path or size that isn’t in the text.” The output goes to a Slack interactive message that pings the team: the proposed row, the confidence per field, and three buttons — approve, edit, discard. On approve, a Lambda writes the row to the Drive sheet via the Sheets API. On edit, the rep gets a fillable form pre-filled with the proposal. On discard, the message is logged and the email moved to a discarded prefix in S3 for audit.

The reason every proposed row goes to a human first is simple: a job the model misread — wrong path, wrong size threshold — is worse than a job that never made it onto the list. The misread one will quietly tell you everything is fine while watching the wrong folder.

Lane 3: heartbeat

Some backups are scripts you wrote, not tools you bought. A cron job that runs pg_dump, a rsync to a NAS, a nightly export of a cloud drive. For those, the cleanest signal is the job telling the sentinel directly that it finished. Add one line at the end of the script — a single web call to a private address, passing a short job key — and the script now “checks in” every time it completes.

That web call lands on a heartbeat Lambda behind a Function URL (a plain web address that runs a Lambda, with no API Gateway in front of it — cheaper and simpler). The first time a new job key checks in, the sentinel doesn’t know it yet, so it proposes a row in the same Slack flow as Lane 2 — one-tap approve to add it. After that, each heartbeat is recorded as evidence that the job finished, with its timestamp and the size the script reported. The power of this lane is the absence of a heartbeat: if a job that usually checks in every night goes quiet, the next check sees no recent heartbeat and flips the job to alert. The job doesn’t have to report its own failure — silence is the failure.

Why the job list stays the source of truth

Three lanes in, but only one place where the checker actually looks. That’s a deliberate constraint. If two lanes both wrote directly to the checker’s state, every “why did this alert fire?” question would mean checking three places. Funneling everything through the Drive sheet means there is exactly one row per job, and any rep can read or edit any of it without learning a new tool. The convenience lanes are first-class for getting jobs in, but they always pass through the sheet on the way.

Next post: how the checker actually reads each job’s evidence, runs its three tests, and picks one of four states.

All posts