Engineering reference: the document pipeline architecture

Posts 1–6 walk through the system in plain language. This page is the dense version — no softening, just the architecture as you’d sketch it on a whiteboard during a design review.

Fig 7. Full architecture, ap-southeast-1. White boxes = AWS resources; dashed AWS container; dashed grey boxes = subsystem groupings; dashed grey arrows = config feed to runtime stages.

Read this top-down, then column-by-column

Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Config Sync, then three runtime columns (Intake, Processing, Routing), with a Cross-cutting strip at the bottom. The dashed grey arrows from the Config Sync output to the validator and router show the only cross-subsystem data dependency — both read the latest rules from S3 on every invocation.

Naming conventions used in the diagram

Lambda functions: fn-<purpose> — e.g. fn-intake, fn-virus-scan, fn-validator, fn-router.
Step Functions workflow: sfn-doc-pipeline (Express type, since per-document runtimes stay well under 5 minutes).
DynamoDB tables: tbl-<name> — tbl-doc-metadata, tbl-review-queue, tbl-audit.
SQS queues: q-<name> with paired q-<name>-dlq.
SNS topics: t-alarms for general failures, t-quarantine for virus-scan rejections.
S3 layout: single bucket doc-pipeline-data with prefixes config/, raw/{date}/, parsed/{date}/, structured/{date}/, quarantine/.

Region and Bedrock model access

Everything runs in ap-southeast-1 (Singapore) for low latency from the Philippines. Bedrock model invocations use the Global cross-Region inference profile (model IDs prefixed with global.) — data at rest stays in Singapore; inference may route to other regions for capacity. Pricing is the same as on-demand Singapore pricing.

Bedrock Anthropic models support Structured Outputs on the Converse and InvokeModel APIs (GA Feb 2026) — the pipeline uses a JSON schema to enforce the field shape per document type, eliminating the “the model returned almost-valid JSON” class of bug.

Textract calls are synchronous for single-page documents (DetectDocumentText) and asynchronous for multi-page or richer extraction (StartDocumentAnalysis → GetDocumentAnalysis), with the choice driven by document type in the rules file. Default is DetectDocumentText — cheap and works for most receipts and invoices, with Bedrock doing the heavy lifting on layout interpretation. AnalyzeDocument with FORMS and TABLES features is reserved for forms-heavy documents where the cost (about a nickel per page) is justified by the layout-extraction quality.

What’s deliberately not on the diagram

IAM policy details — per-Lambda execution role inline policies are minimal (one bucket prefix, one table, one queue as appropriate).
Per-document-type schemas — the rules.json file contains schemas for each type (invoice, receipt, contract, etc.), the field thresholds, and the destination mappings.
X-Ray tracing — on for the Step Functions workflow, sampling 100% during ramp-up, 10% in steady state.
The CloudFormation parameter for the Bedrock model ID and the Textract feature is templated, so swapping models or features doesn’t require code changes.
GuardDuty Malware Protection for S3 — an AWS-managed alternative to a custom ClamAV-on-Lambda virus scan. Cheaper to operate at low volume and zero infrastructure to maintain. Worth swapping in when available in your region.
Bedrock Nova as a single-stage alternative — amazon.nova-lite-v1:0 and amazon.nova-pro-v1:0 can read PDFs directly via vision and return structured output in one call, skipping Textract entirely. Cheaper at low volume on simpler documents; the Textract+Haiku two-stage shape stays the default for accuracy on forms and tables.
Bedrock Batch Inference — for nightly reprocessing or backfilling old documents under a new schema, Bedrock Batch is roughly half the cost of on-demand with a 24-hour SLA.

If you’re recreating this

Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, add Intake next so you have a place to put your documents. Then the Processing pipeline against a single hard-coded type. Then the Validator with one rule. Then the Router pointing at one destination. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.

All posts