A document pipeline on AWS for a few dollars a month
Documents arrive in your business by the dozen — invoices, contracts, receipts, signed forms. Most still get typed into a sheet by hand. Here’s how to build a small pipeline that reads each one for you, checks it, and sends the structured data straight to wherever it needs to go.
The whole system on one page
Before any code, here’s the shape of what we’re building.
What you set up once (the outside)
- A way for documents to arrive — a small upload form, a forwarded email address, or a folder you drop scans into.
- A short rules file — what kinds of documents to expect (invoices, receipts, signed forms), what fields each one should have, and where the clean data should end up.
- The tools you actually use — a Google Sheet, a database, your accounting app. Whatever already runs your business.
What runs quietly in the cloud (the inside)
- The reader — takes a fresh document and pulls out the text, the layout, the table cells, the signatures. Hands the result to the validator.
- The validator — checks the extraction against the rules. Confident reads continue. Anything fishy goes to a short review queue you can clear in seconds.
- The router — takes the clean structured data and sends it where it belongs.
In plain words
Documents arrive however your team already gets them. The cloud reads each one, checks it, and sends the structured data straight to the tools you already use. The system sleeps when there’s nothing to read.
Total cost runs a few dollars a month, not a few hundred.
Design rules that shaped every decision
- Stay inside the AWS always-free quotas wherever possible.
- No always-on server. No NAT Gateway. No infinite log retention.
- The AI is only one step. Cheap rules and validations do most of the gating.
- A human always has the final say on low-confidence extractions — the system never quietly invents fields.
- Configuration lives somewhere a non-engineer can edit — updating rules never needs a deploy.
Why this shape
Most “document AI” tools collapse under one of three weights: a server bill that climbs every month, an extraction that’s wrong just often enough to be unusable, or a workflow no one outside the dev team can change.
The architecture above is the smallest set of moving parts I could find that solves all three at once. One way in (your documents), one way out (your tools), a reader-validator-router pair in the middle that knows when to ask for help. Everything else is plumbing.
The next five posts walk through each piece in turn — how documents enter, how the AI reads them, how extractions stay accurate, how the data flows out, and what the whole thing actually costs. One diagram per post. A final engineering reference at the end gives engineers the dense version with precise service names and model IDs.
All posts