Series · 7 parts Published April 28, 2026

Document pipeline

A serverless pipeline on AWS that reads documents for you, checks each extraction, and sends the structured data to whichever tools you already use. Seven posts on the same system — one diagram at a time — with an engineering reference at the end.

  1. 01

    A document pipeline on AWS for a few dollars a month

    The whole system on one page — a reader, a validator, a router, and a small rules file that controls where everything goes.

  2. 02

    How a document enters the pipeline

    Three ways in — uploaded, emailed, or dropped into a folder. One shared place after intake, screened for trouble before any AI runs.

  3. 03

    How the AI reads a document

    Two AIs in sequence — one specialist for layout, one generalist for meaning. Together they read a document better, faster, and cheaper than either alone.

  4. 04

    How extraction stays accurate

    The validator gates the obvious cases through and queues the unsure ones for a quick human review. Wrong values never quietly slip into your tools.

  5. 05

    How the data flows out

    Each destination is a small pluggable piece. The rules file decides which document type lands where — sheet, accounting software, Slack.

  6. 06

    What the document pipeline costs

    A coffee a month at typical SMB volume. Cents per document, not dollars. Where the bill actually comes from.

  7. 07

    Engineering reference: the document pipeline architecture

    Same system, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, and the actual flow operations.

What does the document pipeline do?
It catches every document that arrives in your business — uploads, emailed attachments, scans dropped in a folder — reads it with AI, validates the extraction against your rules, and sends the structured data straight to whichever tools you already use (a Google Sheet, accounting software like QuickBooks or Xero, or a Slack channel for notifications). Most documents flow straight through; a human only sees the ones the validator wasn’t sure about.
How much does it cost to run?
About $1–$5/month at typical SMB volume of around 100 documents per month. Lambda runs, Step Functions Express orchestration, webhook URLs, queues, alerts, and small DynamoDB tables all sit under the AWS perpetual free tier. The only meaningful spend is page reading via Textract (pennies per page) and AI structuring via Bedrock (about a cent per document). At a thousand documents a month it’s still cents per document, not dollars.
Which document types does it support?
PDFs, images (PNG, JPG), and scanned documents are all supported out of the box. The pipeline handles invoices, receipts, signed contracts, and forms. The rules file describes what each document type should look like — required fields, value shapes, cross-checks — so adding a new type is a config change, not a code change. Forms-heavy documents that need precise table extraction use Textract AnalyzeDocument with FORMS and TABLES features; lighter documents use the cheaper DetectDocumentText path.
How does it stay accurate?
The validator runs two checks per document: do the rules hold (required fields present, values in the right shape, cross-checks like line items adding up to the total), and is the AI confident enough on every field. Each field comes out of the reader with a per-field confidence score, and the threshold is tunable per field type — strict for amounts and dates, looser for free text. Three verdicts result: pass goes straight to the router; review enters a small queue where an operator approves or fixes flagged fields next to the original document image in seconds; reject is held for the operator with a clear note.
How does it avoid making things up?
Bedrock Structured Outputs (GA Feb 2026) enforce a JSON schema per document type, so the model can’t return fields that don’t exist in the schema or values in the wrong shape. The validator’s rule checks then verify required fields are present and cross-checks hold. Anything below a per-field confidence threshold raises a flag and goes to the human review queue rather than landing in your tools. Wrong values never quietly slip through — anything fuzzy hits a human gate first.
Where can the data flow out to?
Three common destinations are wired by default: a Google Sheet (one row per processed document, easiest to share with a non-technical colleague), accounting software like QuickBooks or Xero (for invoices and bills, written straight in with no manual re-entry), and a Slack or email notification (for flagged extractions, signed contracts, or invoices over a threshold). The router fans out by document type — receipts to the sheet, invoices to accounting plus a Slack ping above an amount threshold, contracts to a Drive folder. A new destination is roughly twenty lines of code; the rest of the pipeline doesn’t change.
Which AWS services does it use?
Lambda (with Function URLs for the upload endpoint and operator review UI), Step Functions Express (for the per-document orchestration), Amazon Textract (DetectDocumentText by default; AnalyzeDocument with FORMS and TABLES for forms-heavy types), Bedrock with Claude Haiku 4.5 via Global cross-Region inference (using Structured Outputs for JSON schema enforcement), S3 (single bucket with config/, raw/, parsed/, structured/, quarantine/ prefixes), SES Inbound (for email forwards), SQS (for the review queue with a paired DLQ), DynamoDB on-demand (for metadata and audit), SNS (for alarm and quarantine topics), CloudWatch Logs at 7-day retention, and AWS Budgets at $15/month. Region is ap-southeast-1 (Singapore).
All posts