Part 7 of 7 · Document pipeline series ~3 min read

Engineering reference: the document pipeline architecture

Same system as the rest of the series, drawn purely for engineers. Service names, resource identifiers, region, Bedrock model IDs, and the actual flow operations — everything you’d need to recreate this in your own AWS account.

Posts 1–6 walk through the system in plain language. This page is the dense version — no softening, just the architecture as you’d sketch it on a whiteboard during a design review.

Full technical architecture: serverless AI document pipeline in ap-southeast-1 A detailed engineering diagram of the entire document pipeline. Three external surfaces at the top: GitHub (repo and Actions runner, OIDC token requestor), Google Drive (config folder containing the rules file with changes.watch push notifications), and external destinations (Google Sheets, QuickBooks/Xero, Slack — the operator’s tools). Everything runs in a single AWS account in region ap-southeast-1 (Singapore). The AWS account contains five subsystems. Build and Deploy strip at the top: GitHub Actions exchanges with IAM OIDC Provider, assumes an IAM Role with a trust policy scoped to repo:owner/repo:ref:main, and runs SAM/CloudFormation to update the doc-pipeline-prod stack. Config Sync strip below: a Lambda Function URL named fn-config-sync receives Drive changes.watch notifications, validates the rules content, and writes to S3 doc-pipeline-data/config/. Three runtime columns below. Intake (entry points): a Lambda Function URL fn-intake handles direct uploads via multipart POST; Amazon SES Inbound receives forwarded emails and writes attachments to S3; both write to S3 doc-pipeline-data/raw/{date}/. Processing (Step Functions Express): the S3 PutObject triggers sfn-doc-pipeline, which orchestrates Lambda fn-virus-scan, then Amazon Textract DetectDocumentText for layout (with AnalyzeDocument FORMS+TABLES as a per-type alternative for forms-heavy documents), then Bedrock global.anthropic.claude-haiku-4-5 with Structured Outputs (JSON schema) for structured field extraction, then Lambda fn-validator which applies rule checks and per-field confidence thresholds. Routing (verdict and dispatch): the validator returns one of three verdicts (pass, review, reject); pass goes straight to Lambda fn-router, review goes to SQS q-review and a Lambda Function URL fn-review-resolver where the operator approves or fixes flagged fields, and reject goes to S3 quarantine with an SNS notification; fn-router then dispatches the structured data to the configured external destinations (Google Sheets API, accounting software API, Slack webhook). Cross-cutting bottom strip: DynamoDB tables tbl-doc-metadata, tbl-review-queue, and tbl-audit log every action; CloudWatch Logs are configured with RetentionInDays of 7 across every log group; SNS topics t-alarms and t-quarantine email the operator on failures; AWS Budgets has a $15 monthly alarm; Lambda fn-archive runs on a separate weekly cron 0 3 SUN to move old raw documents to S3 Glacier Instant Retrieval storage class. GitHub github.com/owner/repo Actions runner · OIDC token requestor Google Drive config folder · rules.json changes.watch push notifications External destinations Sheets · QuickBooks · Slack operator’s downstream tools AWS Account Region: ap-southeast-1 (Singapore) · Bedrock via Global CRIS Build & Deploy IAM OIDC Provider token.actions.githubusercontent.com IAM Role trust: repo:owner/repo:ref:main SAM / CloudFormation stack: doc-pipeline-prod git push & request token AssumeRole sam deploy → creates stack resources below Config Sync Lambda Function URL fn-config-sync (validates rules) S3 doc-pipeline-data/config/rules.json changes.watch notification read by validator + router Intake (entry points) Lambda Function URL fn-intake (multipart POST) Amazon SES Inbound receipt rule → S3 attachments PutObject S3 doc-pipeline-data/raw/{date}/ → PutObject triggers Step Functions Processing (orchestrated) Step Functions Express sfn-doc-pipeline AWS Lambda fn-virus-scan (state 1) Amazon Textract DetectDocumentText (default) InvokeModel Bedrock Haiku 4.5 global.anthropic.claude-haiku-4-5 Structured Outputs (JSON schema) AWS Lambda fn-validator (rules + confidence) → verdict to Routing column Routing (verdict + dispatch) Validator verdict pass / review / reject if review SQS q-review (+ q-review-dlq) Lambda Function URL fn-review-resolver (operator UI) resolved AWS Lambda fn-router (per-type fan-out) HTTP / SDK calls External destinations Sheets · QuickBooks · Slack → data lands in operator’s tools rules feed validator + router Cross-cutting DynamoDB tbl-doc-metadata, tbl-audit CloudWatch Logs RetentionInDays: 7 SNS t-alarms, t-quarantine AWS Budgets budget-monthly: $15 Lambda fn-archive EventBridge cron(0 3 ? * SUN *) → moves old raw documents to S3 Glacier Instant Retrieval
Fig 7. Full architecture, ap-southeast-1. White boxes = AWS resources; dashed AWS container; dashed grey boxes = subsystem groupings; dashed grey arrows = config feed to runtime stages.

Read this top-down, then column-by-column

Top row is the three external surfaces. Below it, the AWS account contains five subsystems: Build & Deploy across the top, then Config Sync, then three runtime columns (Intake, Processing, Routing), with a Cross-cutting strip at the bottom. The dashed grey arrows from the Config Sync output to the validator and router show the only cross-subsystem data dependency — both read the latest rules from S3 on every invocation.

Naming conventions used in the diagram

  • Lambda functions: fn-<purpose> — e.g. fn-intake, fn-virus-scan, fn-validator, fn-router.
  • Step Functions workflow: sfn-doc-pipeline (Express type, since per-document runtimes stay well under 5 minutes).
  • DynamoDB tables: tbl-<name>tbl-doc-metadata, tbl-review-queue, tbl-audit.
  • SQS queues: q-<name> with paired q-<name>-dlq.
  • SNS topics: t-alarms for general failures, t-quarantine for virus-scan rejections.
  • S3 layout: single bucket doc-pipeline-data with prefixes config/, raw/{date}/, parsed/{date}/, structured/{date}/, quarantine/.

Region and Bedrock model access

Everything runs in ap-southeast-1 (Singapore) for low latency from the Philippines. Bedrock model invocations use the Global cross-Region inference profile (model IDs prefixed with global.) — data at rest stays in Singapore; inference may route to other regions for capacity. Pricing is the same as on-demand Singapore pricing.

Bedrock Anthropic models support Structured Outputs on the Converse and InvokeModel APIs (GA Feb 2026) — the pipeline uses a JSON schema to enforce the field shape per document type, eliminating the “the model returned almost-valid JSON” class of bug.

Textract calls are synchronous for single-page documents (DetectDocumentText) and asynchronous for multi-page or richer extraction (StartDocumentAnalysisGetDocumentAnalysis), with the choice driven by document type in the rules file. Default is DetectDocumentText — cheap and works for most receipts and invoices, with Bedrock doing the heavy lifting on layout interpretation. AnalyzeDocument with FORMS and TABLES features is reserved for forms-heavy documents where the cost (about a nickel per page) is justified by the layout-extraction quality.

What’s deliberately not on the diagram

  • IAM policy details — per-Lambda execution role inline policies are minimal (one bucket prefix, one table, one queue as appropriate).
  • Per-document-type schemas — the rules.json file contains schemas for each type (invoice, receipt, contract, etc.), the field thresholds, and the destination mappings.
  • X-Ray tracing — on for the Step Functions workflow, sampling 100% during ramp-up, 10% in steady state.
  • The CloudFormation parameter for the Bedrock model ID and the Textract feature is templated, so swapping models or features doesn’t require code changes.
  • GuardDuty Malware Protection for S3 — an AWS-managed alternative to a custom ClamAV-on-Lambda virus scan. Cheaper to operate at low volume and zero infrastructure to maintain. Worth swapping in when available in your region.
  • Bedrock Nova as a single-stage alternativeamazon.nova-lite-v1:0 and amazon.nova-pro-v1:0 can read PDFs directly via vision and return structured output in one call, skipping Textract entirely. Cheaper at low volume on simpler documents; the Textract+Haiku two-stage shape stays the default for accuracy on forms and tables.
  • Bedrock Batch Inference — for nightly reprocessing or backfilling old documents under a new schema, Bedrock Batch is roughly half the cost of on-demand with a 24-hour SLA.

If you’re recreating this

Start with Build & Deploy alone (a single Lambda, no triggers). Once git push reliably updates an empty stack, add Intake next so you have a place to put your documents. Then the Processing pipeline against a single hard-coded type. Then the Validator with one rule. Then the Router pointing at one destination. Cross-cutting (audit, logs, alarms, budget, archive) goes in from day one.

All posts