Part 7 of 7 · Tax doc collector series ~8 min read

Engineering reference: the tax doc collector architecture

Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the secure-upload flow. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, Textract, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at small-practice volume — the failure mode for a practice is a reminder that goes out a day late, not a regional outage. One AWS account dedicated to the collector (separate from your other workloads) keeps the IAM blast radius small, isolates client documents, and lets a single AWS Budgets alarm cover the whole system.

Topology

AWS topology of the tax doc collector A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: intake. Three boxes show the three setup-and-upload paths — a Drive sheet sync via the drive-sync Lambda triggered every 15 minutes by EventBridge Scheduler that mirrors the checklist CSV to s3://td-clients-source/, an upload Function URL upload-handler that serves the secure upload page, writes the file to s3://td-uploads/, runs Textract, and calls Bedrock Haiku 4.5 to confirm the document type, and an intake-form Function URL intake-form that picks a checklist for a new client and proposes a file for preparer approval. Middle region: scheduled processing. The tracker Lambda is triggered daily at 8am local by EventBridge Scheduler; it reads s3://td-clients-source/clients.csv, iterates rows, computes the still-missing items and days-since-first-request per file, looks up the cadence in s3://td-rules-source/rules.txt, reads send and upload state from DynamoDB, and emits one of three events to the EventBridge default bus per file that needs an action: td.first_request, td.reminder, or td.escalate. Complete files emit a td.complete event instead. Bottom region: dispatch and review. The dispatch Lambda is triggered by an EventBridge rule on the request, reminder, and escalate events; it resolves the contact, checks quiet hours and the holiday calendar, fetches the template from s3://td-rules-source/voice.txt, sends the request via SES outbound with a fresh signed upload link, and writes a row to DynamoDB td-sends. Preparer status-board actions land on a Function URL Lambda action-handler that updates td-state and td-audit with the action (accept, reject-item, reopen) and, on accept or reopen, updates the checklist sheet via the Google Sheets API. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $15 monthly threshold, posting to SNS topic td-cost-alarm. A note at the bottom: a human reviews every file before final — and every interaction is logged to td-audit. Intake Lambda · drive-sync every 15 min Sheets API → s3://td-clients-source/ clients.csv Function URL · upload signed link page → s3://td-uploads/ Textract + Haiku 4.5 confirm doc type Function URL · intake-form new-client form picks checklist for the client type → preparer approval Drive checklist sheet canonical store · mirrored to S3 Scheduled processing EventBridge Scheduler cron(0 8 * * ? *) in TZ_NAME target: tracker Lambda + deferred one-offs Lambda · tracker reads CSV from S3 + rules.txt + voice.txt computes missing, picks one of four moves EventBridge default bus td.first_request td.reminder td.escalate (complete → td.complete) Dispatch & review Lambda · dispatch resolves contact, quiet hours, holidays; SES outbound with a signed upload link Status board per-client view with [Accept] [Reject] [Reopen] → Function URL Lambda · action-handler writes td-state, td-audit, and on accept/reopen updates the Sheet via Sheets API A human reviews every file before final — and every interaction is logged to td-audit.
Fig 7. AWS topology, in three regions of the diagram: intake (setup lanes and the secure upload path into the checklist), scheduled processing (the daily chase tick emitting events), dispatch and review (the request ships and the preparer’s decision is recorded). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • drive-sync — EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager under td/drive/sa) to export the checklist sheet as CSV and write to s3://td-clients-source/clients.csv only if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs to s3://td-rules-source/. Memory: 256 MB. Timeout: 30 s.
  • upload-handler — Lambda Function URL, public with AuthType: NONE; every request carries a signed, time-limited token (HMAC over client_id + exp, key in Secrets Manager under td/upload/signing-key). On GET, serves the upload page listing the file’s open items. On POST, validates the token, writes the file to s3://td-uploads/<client_id>/<upload_id>, and enqueues the read job. The S3 PUT triggers intake-classify. Memory: 512 MB. Timeout: 30 s.
  • intake-classify — S3 PUT trigger on s3://td-uploads/. Runs Textract via StartDocumentTextDetection + StartDocumentAnalysis (asynchronously to handle multi-page documents). On Textract completion (via SNS notification), reads the structured text and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) to name the best-matching open checklist item with a confidence score. On a confident match, marks the item received in td-state and links the upload in td-uploads; otherwise routes to the preparer’s needs-filing queue. The prompt is bounded to type-confirmation only and never extracts amounts. For DOCX uploads (Textract doesn’t accept them), falls back to python-docx; XLSX uses openpyxl. Both packages are stable and widely used in 2026, though their maintenance velocity is light — for a path that runs a few times per client, that’s acceptable; the community fork python-docx-oss is a drop-in alternative if extraction precision becomes a concern. Memory: 512 MB. Timeout: 60 s.
  • intake-form — Lambda Function URL for the new-client intake form. On submit, reads the answers, builds the right checklist for the client type from the rules doc (including conditional items), and posts a preparer approval card. On approve, writes the new row to the Drive sheet via the Sheets API. Memory: 256 MB. Timeout: 30 s.
  • tracker — EventBridge Scheduler target, daily at 8am local time (the schedule expression runs in TZ_NAME set to the practice’s timezone, e.g. Asia/Singapore). Reads s3://td-clients-source/clients.csv and the rules and voice docs. For each row, computes the still-missing items and days-since-first-request, reads send state from td-sends and item state from td-state, decides on a move. Emits one event per row that needs action: td.first_request, td.reminder, td.escalate, or td.complete, with the file context as the event payload. Healthy in-progress files emit nothing. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.
  • dispatch — EventBridge rule on the request/reminder/escalate events. Resolves contact (per-file email plus any handoff), checks quiet hours and holiday calendar, formats the request from the voice template with only the missing items and a fresh signed upload link, and sends via SES SendRawEmail. On td.complete, notifies the preparer with a status-board link instead. On quiet-hours or holiday defer, creates a one-off EventBridge Scheduler rule that re-invokes dispatch at the next available business minute. Writes a row to td-sends after a successful send. Memory: 256 MB. Timeout: 30 s.
  • action-handler — Lambda Function URL for the status-board actions; authenticated by the preparer’s session cookie (the board is a small authenticated app). Handles Accept, Reject-item, and Reopen. Writes to td-state and td-audit; on accept sets the file done; on reject-item drops one item to waiting and triggers a single-item request; on reopen adds an item and re-enters the cadence. Updates the Drive sheet via the Sheets API. Memory: 256 MB. Timeout: 15 s.
  • digest — EventBridge Scheduler target, weekly Monday 7am. Reads td-state and the checklist; sends the preparer a digest summarizing files complete this week, files stuck, and longest-waiting clients. No Bedrock; a plain summary table. Memory: 256 MB.
  • summary — EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’s td-sends, td-state, and td-audit; calls Bedrock Haiku 4.5 to write a one-paragraph practice narrative; emails it via SES to the configured partner list. Memory: 512 MB.

Storage

  • DynamoDB · td-state — one row per checklist item per file. PK (client_id, item_id); attributes: status (waiting/received/accepted/rejected), upload_id, confirmed_type, confidence, reviewed_by. On-demand.
  • DynamoDB · td-sends — one row per dispatch. PK (client_id, step); attributes: sent_date, move (first_request/reminder/escalate), recipient, missing_count. On-demand. No TTL.
  • DynamoDB · td-uploads — one row per uploaded file. PK (client_id, upload_id); attributes: s3_key, matched_item, confidence, uploaded_at, review_state. On-demand.
  • DynamoDB · td-audit — one row per write action of any kind. PK (client_id, ts); attributes: action, by_user, before, after. On-demand. No TTL — this is the long-term audit trail.
  • S3 · td-clients-source — mirrored CSV from the Drive checklist sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years.
  • S3 · td-rules-source — mirrored rules and voice docs as plain text. Versioning enabled.
  • S3 · td-uploads — client-uploaded documents. Block all public access; SSE encryption; versioning enabled; lifecycle to Glacier at 180 days; expiry at 7 years. Access only via short-lived presigned URLs from the status board.
  • S3 · td-archive — prior-season files and documents, kept for reference when a returning client’s file is copied forward.

Bedrock

  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites: intake-classify for the per-upload type-confirmation, and summary for the monthly practice narrative. Claude Sonnet 4.6 (global.anthropic.claude-sonnet-4-6-20250930-v1:0) is available as a fallback for uploads the Haiku pass flags as low-confidence, but in practice tax documents are recognizable enough that Haiku handles them, and a low-confidence upload routes to a human anyway.
  • Embeddings. Not used. The checklist is structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors.
  • Quotas. Default account quotas are more than enough at small-practice volume. The tracker itself doesn’t call Bedrock; the classify lane fires once per uploaded document.

EventBridge Scheduler config

  • td-daily-tickcron(0 8 * * ? *) in the practice’s timezone. Target: tracker Lambda.
  • td-drive-syncrate(15 minutes). Target: drive-sync Lambda.
  • td-weekly-digestcron(0 7 ? * MON *) in TZ. Target: digest Lambda.
  • td-monthly-summarycron(0 9 ? * 2#1 *) (first Monday at 9am) in TZ. Target: summary Lambda.
  • One-off rules — created on the fly by dispatch when a quiet-hours or holiday defer is needed. Use at(YYYY-MM-DDTHH:MM:SS) expressions with --action-after-completion DELETE so the rule self-cleans.

SES inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. docs.your-practice.com) to inbound-smtp.ap-southeast-1.amazonaws.com if you want clients to be able to reply or forward documents by email.
  • SES inbound rule set td-inbound-rules: one rule with recipient docs@your-practice.com → spam scan → S3 PUT to s3://td-uploads/inbound/<message-id> → stop. The S3 PUT triggers intake-classify via the same path as an upload.
  • SES outbound for the requests and reminders: verify a sender identity at docs@your-practice.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • tracker role: s3:GetObject on the clients, rules, and voice keys; dynamodb:Query + GetItem on td-sends, td-state; events:PutEvents on the default bus. No bedrock:*.
  • dispatch role: scheduler:CreateSchedule for the deferred-send one-offs; secretsmanager:GetSecretValue on the upload signing key; ses:SendRawEmail from the verified sender identity; dynamodb:PutItem on td-sends.
  • upload-handler role: s3:PutObject on td-uploads; secretsmanager:GetSecretValue on the upload signing key; dynamodb:PutItem on td-uploads.
  • intake-classify role: s3:GetObject on td-uploads; textract:StartDocumentTextDetection + StartDocumentAnalysis; bedrock:InvokeModel on the Haiku ARN; dynamodb:PutItem on td-state and td-uploads.
  • action-handler role: dynamodb:PutItem on td-state and td-audit; secretsmanager:GetSecretValue on the Sheets-API service-account secret; outbound network to sheets.googleapis.com; s3:GetObject on td-uploads for presigned review links.
  • drive-sync role: secretsmanager:GetSecretValue on the Google service-account secret; s3:PutObject on the clients and rules buckets; outbound network to www.googleapis.com.

Secure upload and review flow

Upload links are signed tokens, not session cookies: an HMAC over client_id, file_set, and an expiry, signed with the key in td/upload/signing-key. upload-handler verifies the signature and expiry on every request; an expired link renders a “request a fresh link” page that triggers a new td.reminder. The bucket is fully private; the preparer’s status board generates short-lived presigned GET URLs on demand to render thumbnails and previews, so document bytes are never served from a durable public URL.

The status board itself is a small authenticated app (the practice’s staff log in); its action buttons post to action-handler with the preparer’s session. Client-facing surfaces (upload page, intake form) are unauthenticated but token-gated; staff-facing surfaces (status board, actions) require login. That split keeps clients out of each other’s files without making them manage a password.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: tracker Lambda failures > 0 in a day (the daily tick has to run); intake-classify failure rate > 1% in 24h; upload-handler token-verification failures > 20/hour (might mean a leaked or stale link being retried).
  • X-Ray: off by default. Not worth the cost at small-practice volume.
  • AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic td-cost-alarm subscribed to the on-call partner’s email.

Config and secrets

Service-account credentials for Drive and Sheets APIs live in Secrets Manager under td/drive/sa (one service account with scopes for both APIs). The upload signing key is td/upload/signing-key. SES sender identity lives in IAM and the verified-domain config. The configured timezone, holiday list reference, quiet-hours window, default due date, and the per-client-type checklists all live in Parameter Store under /td/config/ (with the larger checklist templates in the Drive rules doc). Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

Whichever IaC you prefer. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning and block-public-access for td-uploads so a client document is never exposed and a re-upload never silently overwrites, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the daily tick in UTC after a CI rotation. CDK with a Python stack file works well; SAM also fits, and matches the GitHub Actions + OIDC deploy with no long-lived keys. Total deployable surface: around nine Lambdas, four DynamoDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your practice, see Work with me.

All posts