Engineering reference: the document pipeline architecture
Same system as the rest of the series, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, and the actual flow operations — everything you’d need to recreate this in your own AWS account.
Key takeaways
- Single AWS account, region
ap-southeast-1, Bedrock invoked via Global cross-Region inference. - Step Functions Express orchestrates virus scan, Textract, Bedrock Haiku 4.5, and the validator.
- Bedrock Structured Outputs (JSON schema) eliminate the “almost-valid JSON” class of bug.
- Three intake lanes (Function URL, SES inbound, S3 PutObject) all converge on one S3 prefix.
- CloudWatch logs at 7-day retention, $15 monthly budget alarm, weekly archive to Glacier IR.
Posts 1–6 walk through the system in plain language. This page is the dense version — no softening, just the architecture as you’d sketch it on a whiteboard during a design review.
Read this top-down, then column-by-column
Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Config Sync, then three runtime columns (Intake, Processing, Routing), with a Cross-cutting strip at the bottom. The dashed grey arrows from the Config Sync output to the validator and router show the only cross-subsystem data dependency — both read the latest rules from S3 on every invocation.
Naming conventions used in the diagram
- Lambda functions:
fn-<purpose>— e.g.fn-intake,fn-virus-scan,fn-validator,fn-router. - Step Functions workflow:
sfn-doc-pipeline(Express type, since per-document runtimes stay well under 5 minutes). - DynamoDB tables:
tbl-<name>—tbl-doc-metadata,tbl-review-queue,tbl-audit. - SQS queues:
q-<name>with pairedq-<name>-dlq. - SNS topics:
t-alarmsfor general failures,t-quarantinefor virus-scan rejections. - S3 layout: single bucket
doc-pipeline-datawith prefixesconfig/,raw/{date}/,parsed/{date}/,structured/{date}/,quarantine/.
Region and Bedrock model access
Everything runs in ap-southeast-1 (Singapore) for low latency from the Philippines. Bedrock model invocations use the Global cross-Region inference profile (model IDs prefixed with global.) — data at rest stays in Singapore; inference may route to other regions for capacity. Pricing is the same as on-demand Singapore pricing.
Bedrock Anthropic models support Structured Outputs on the Converse and InvokeModel APIs (GA Feb 2026) — the pipeline uses a JSON schema to enforce the field shape per document type, eliminating the “the model returned almost-valid JSON” class of bug.
Textract calls are synchronous for single-page documents (DetectDocumentText) and asynchronous for multi-page or richer extraction (StartDocumentAnalysis → GetDocumentAnalysis), with the choice driven by document type in the rules file. Default is DetectDocumentText — cheap and works for most receipts and invoices, with Bedrock doing the heavy lifting on layout interpretation. AnalyzeDocument with FORMS and TABLES features is reserved for forms-heavy documents where the cost (about a nickel per page) is justified by the layout-extraction quality.
What’s deliberately not on the diagram
- IAM policy details — per-Lambda execution role inline policies are minimal (one bucket prefix, one table, one queue as appropriate).
- Per-document-type schemas — the
rules.jsonfile contains schemas for each type (invoice, receipt, contract, etc.), the field thresholds, and the destination mappings. - X-Ray tracing — on for the Step Functions workflow, sampling 100% during ramp-up, 10% in steady state.
- The CloudFormation parameter for the Bedrock model ID and the Textract feature is templated, so swapping models or features doesn’t require code changes.
- GuardDuty Malware Protection for S3 — an AWS-managed alternative to a custom ClamAV-on-Lambda virus scan. Cheaper to operate at low volume and zero infrastructure to maintain. Worth swapping in when available in your region.
- Bedrock Nova as a single-stage alternative —
amazon.nova-lite-v1:0andamazon.nova-pro-v1:0can read PDFs directly via vision and return structured output in one call, skipping Textract entirely. Cheaper at low volume on simpler documents; the Textract+Haiku two-stage shape stays the default for accuracy on forms and tables. - Bedrock Batch Inference — for nightly reprocessing or backfilling old documents under a new schema, Bedrock Batch is roughly half the cost of on-demand with a 24-hour SLA.
If you’re recreating this
Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, add Intake next so you have a place to put your documents. Then the Processing pipeline against a single hard-coded type. Then the Validator with one rule. Then the Router pointing at one destination. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.