Part 1 of 7 · Document pipeline series ~5 min read

A document pipeline on AWS for a few dollars a month

Documents arrive in your business by the dozen — invoices, contracts, receipts, signed forms. Most still get typed into a sheet by hand. Here’s how to build a small pipeline that reads each one for you, checks it, and sends the structured data straight to wherever it needs to go.

The whole system on one page

Before any code, here’s the shape of what we’re building.

System architecture: three outside surfaces, three inside AWS At the top, three external surfaces in a row. On the left, “Documents” — uploaded files, emailed attachments, dropped scans. In the middle, “Rules” — a small file describing what each document type should look like and where it should go. On the right, “Your tools” — the sheet, database, or app that needs the structured data. Each connects via an arrow to a containing box labeled “AWS account”. Documents feed in. Rules feed in. The AWS account sends structured data out to your tools. Inside the AWS account are three components in a row, mirroring the layout above. On the left, the Reader — extracts text and structure from each document. In the middle, the Validator — checks the extraction against the rules; flags anything it isn’t sure about. On the right, the Router — pushes the clean structured data to your tools. Internal arrows flow left to right. A note at the bottom reads: most documents flow straight through. A human only sees the ones the validator wasn’t sure about. Documents uploads, emails, scans Rules what each type looks like Your tools sheet, database, app arrive guide structured data AWS account Reader extracts text and structure Validator checks each one, flags the unsure ones Router pushes to your tools extracted approved Most documents flow straight through. A human only sees the ones the validator wasn’t sure about.
Fig 1. Three outside surfaces, three pieces inside AWS. Documents in, structured data out.

What you set up once (the outside)

  • A way for documents to arrive — a small upload form, a forwarded email address, or a folder you drop scans into.
  • A short rules file — what kinds of documents to expect (invoices, receipts, signed forms), what fields each one should have, and where the clean data should end up.
  • The tools you actually use — a Google Sheet, a database, your accounting app. Whatever already runs your business.

What runs quietly in the cloud (the inside)

  • The reader — takes a fresh document and pulls out the text, the layout, the table cells, the signatures. Hands the result to the validator.
  • The validator — checks the extraction against the rules. Confident reads continue. Anything fishy goes to a short review queue you can clear in seconds.
  • The router — takes the clean structured data and sends it where it belongs.

In plain words

Documents arrive however your team already gets them. The cloud reads each one, checks it, and sends the structured data straight to the tools you already use. The system sleeps when there’s nothing to read.

Total cost runs a few dollars a month, not a few hundred.

Design rules that shaped every decision

  • Stay inside the AWS always-free quotas wherever possible.
  • No always-on server. No NAT Gateway. No infinite log retention.
  • The AI is only one step. Cheap rules and validations do most of the gating.
  • A human always has the final say on low-confidence extractions — the system never quietly invents fields.
  • Configuration lives somewhere a non-engineer can edit — updating rules never needs a deploy.

Why this shape

Most “document AI” tools collapse under one of three weights: a server bill that climbs every month, an extraction that’s wrong just often enough to be unusable, or a workflow no one outside the dev team can change.

The architecture above is the smallest set of moving parts I could find that solves all three at once. One way in (your documents), one way out (your tools), a reader-validator-router pair in the middle that knows when to ask for help. Everything else is plumbing.

The next five posts walk through each piece in turn — how documents enter, how the AI reads them, how extractions stay accurate, how the data flows out, and what the whole thing actually costs. One diagram per post. A final engineering reference at the end gives engineers the dense version with precise service names and model IDs.

All posts