How the ingestor walks your sources
Once a morning, a small dispatcher reads your source list, hands each source to a specialist worker, and drops everything new into one shared place. Three workers, one mailbox — the boring part of the system.
Why three workers, not one
Not every source speaks the same language.
- Feeds — many sites publish a tidy stream of headlines and summaries the bot can read directly. The easy case.
- Public services — some sites give you the data through a public service in whatever shape they chose years ago. Needs a small adapter per service.
- Plain web pages — everything else. The bot pulls the article body out of the page and drops the navigation and the cookie banner.
You could write one worker that does all three. But then a single bad change breaks every source at once. Three small workers — one per kind — keep the failure contained. If the page worker chokes on a redesigned site, the other two keep flowing.
What “new” actually means
Each fetched item gets a short fingerprint — the URL plus the publish date. Before the worker does any further work, it asks: “have I seen this fingerprint before?” If yes, skip. If no, save the fingerprint and pass the item along.
This is how the bot reads the same feed every morning without flooding you with duplicates. It also means if a source republishes an old article under the same address, the bot ignores it — same fingerprint.
Three things that go wrong (and what happens then)
- A source is offline. The worker waits a few seconds, tries twice more, then gives up and logs it. The other workers keep going. Tomorrow it’ll try again from scratch.
- A page changes its layout. The page worker can’t find the article body. Item is logged as “couldn’t parse” and skipped. The digest still ships — you fix the parser later, not in a panic.
- A source dumps a hundred items. The worker takes them all and lets the next stage decide which to keep. The ingestor never makes editorial decisions. Its only job is to fetch.
In plain words
The ingestor is the boring part. It doesn’t read the items, doesn’t judge them, doesn’t decide what’s interesting. Its only job is “go to the URLs in this list, bring back what’s new, put it in this one box.” Three workers run at once so one slow site doesn’t hold up the others. Whatever they bring home goes to the ranker — the next stop.
All posts