Part 2 of 7 · Daily briefing bot series ~4 min read

How the ingestor walks your sources

Once a morning, a small dispatcher reads your source list, hands each source to a specialist worker, and drops everything new into one shared place. Three workers, one mailbox — the boring part of the system.

The ingestor: fan-out to three workers, fan-in to one mailbox At the top, two boxes side by side. On the left, the morning timer that fires the run. On the right, the source list — a file the dispatcher reads. Both feed arrows down into a central Dispatcher box. From the dispatcher, three arrows fan out to three workers in a row: an RSS worker on the left, an API worker in the middle, and a page worker on the right. Each worker has a small label describing what it does: the RSS worker reads headlines and summaries, the API worker reads JSON results, the page worker scrapes and extracts article text. From all three workers, arrows fan back in to a single box at the bottom labeled “Today’s items” — deduped, ready to score. A note at the bottom reads: workers run in parallel; one slow source doesn’t hold up the others. Morning timer fires once a day Source list file you can edit Dispatcher routes by source type fires current list RSS worker headlines & summaries API worker JSON results from public APIs Page worker scrape & extract article text fan-out: parallel Today’s items deduped, ready to score fan-in: one mailbox Workers run in parallel. One slow source doesn’t hold up the others.
Fig 2. The ingestor: fan-out across three worker types, fan-in to one shared mailbox.

Why three workers, not one

Not every source speaks the same language.

  • Feeds — many sites publish a tidy stream of headlines and summaries the bot can read directly. The easy case.
  • Public services — some sites give you the data through a public service in whatever shape they chose years ago. Needs a small adapter per service.
  • Plain web pages — everything else. The bot pulls the article body out of the page and drops the navigation and the cookie banner.

You could write one worker that does all three. But then a single bad change breaks every source at once. Three small workers — one per kind — keep the failure contained. If the page worker chokes on a redesigned site, the other two keep flowing.

What “new” actually means

Each fetched item gets a short fingerprint — the URL plus the publish date. Before the worker does any further work, it asks: “have I seen this fingerprint before?” If yes, skip. If no, save the fingerprint and pass the item along.

This is how the bot reads the same feed every morning without flooding you with duplicates. It also means if a source republishes an old article under the same address, the bot ignores it — same fingerprint.

Three things that go wrong (and what happens then)

  • A source is offline. The worker waits a few seconds, tries twice more, then gives up and logs it. The other workers keep going. Tomorrow it’ll try again from scratch.
  • A page changes its layout. The page worker can’t find the article body. Item is logged as “couldn’t parse” and skipped. The digest still ships — you fix the parser later, not in a panic.
  • A source dumps a hundred items. The worker takes them all and lets the next stage decide which to keep. The ingestor never makes editorial decisions. Its only job is to fetch.

In plain words

The ingestor is the boring part. It doesn’t read the items, doesn’t judge them, doesn’t decide what’s interesting. Its only job is “go to the URLs in this list, bring back what’s new, put it in this one box.” Three workers run at once so one slow site doesn’t hold up the others. Whatever they bring home goes to the ranker — the next stop.

All posts