Engineering reference: the tax doc collector architecture
Same system, drawn for engineers. Region, service names, resource identifiers, Bedrock model IDs, Lambda inventory, IAM scopes, the SES inbound rule set, EventBridge Scheduler config, the DynamoDB schemas, and the secure-upload flow. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). SES inbound, Bedrock Global cross-Region inference, Textract, and EventBridge Scheduler are all in good shape there. A second region for multi-region resilience isn’t worth the extra setup work at small-practice volume — the failure mode for a practice is a reminder that goes out a day late, not a regional outage. One AWS account dedicated to the collector (separate from your other workloads) keeps the IAM blast radius small, isolates client documents, and lets a single AWS Budgets alarm cover the whole system.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
drive-sync— EventBridge Scheduler target, fires every 15 minutes. Uses the Google Drive API + Sheets API (service-account credentials in Secrets Manager undertd/drive/sa) to export the checklist sheet as CSV and write tos3://td-clients-source/clients.csvonly if the sheet has changed since the last sync. Same pattern syncs the rules and voice docs tos3://td-rules-source/. Memory: 256 MB. Timeout: 30 s.upload-handler— Lambda Function URL, public withAuthType: NONE; every request carries a signed, time-limited token (HMAC overclient_id+exp, key in Secrets Manager undertd/upload/signing-key). On GET, serves the upload page listing the file’s open items. On POST, validates the token, writes the file tos3://td-uploads/<client_id>/<upload_id>, and enqueues the read job. The S3 PUT triggersintake-classify. Memory: 512 MB. Timeout: 30 s.intake-classify— S3 PUT trigger ons3://td-uploads/. Runs Textract viaStartDocumentTextDetection+StartDocumentAnalysis(asynchronously to handle multi-page documents). On Textract completion (via SNS notification), reads the structured text and calls Bedrock Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0viaglobal.anthropic.claude-haiku-4-5-20251001-v1:0) to name the best-matching open checklist item with a confidence score. On a confident match, marks the item received intd-stateand links the upload intd-uploads; otherwise routes to the preparer’s needs-filing queue. The prompt is bounded to type-confirmation only and never extracts amounts. For DOCX uploads (Textract doesn’t accept them), falls back topython-docx; XLSX usesopenpyxl. Both packages are stable and widely used in 2026, though their maintenance velocity is light — for a path that runs a few times per client, that’s acceptable; the community forkpython-docx-ossis a drop-in alternative if extraction precision becomes a concern. Memory: 512 MB. Timeout: 60 s.intake-form— Lambda Function URL for the new-client intake form. On submit, reads the answers, builds the right checklist for the client type from the rules doc (including conditional items), and posts a preparer approval card. On approve, writes the new row to the Drive sheet via the Sheets API. Memory: 256 MB. Timeout: 30 s.tracker— EventBridge Scheduler target, daily at 8am local time (the schedule expression runs inTZ_NAMEset to the practice’s timezone, e.g.Asia/Singapore). Readss3://td-clients-source/clients.csvand the rules and voice docs. For each row, computes the still-missing items and days-since-first-request, reads send state fromtd-sendsand item state fromtd-state, decides on a move. Emits one event per row that needs action:td.first_request,td.reminder,td.escalate, ortd.complete, with the file context as the event payload. Healthy in-progress files emit nothing. Memory: 512 MB. Timeout: 60 s. No Bedrock calls.dispatch— EventBridge rule on the request/reminder/escalate events. Resolves contact (per-file email plus any handoff), checks quiet hours and holiday calendar, formats the request from the voice template with only the missing items and a fresh signed upload link, and sends via SESSendRawEmail. Ontd.complete, notifies the preparer with a status-board link instead. On quiet-hours or holiday defer, creates a one-off EventBridge Scheduler rule that re-invokesdispatchat the next available business minute. Writes a row totd-sendsafter a successful send. Memory: 256 MB. Timeout: 30 s.action-handler— Lambda Function URL for the status-board actions; authenticated by the preparer’s session cookie (the board is a small authenticated app). Handles Accept, Reject-item, and Reopen. Writes totd-stateandtd-audit; on accept sets the file done; on reject-item drops one item to waiting and triggers a single-item request; on reopen adds an item and re-enters the cadence. Updates the Drive sheet via the Sheets API. Memory: 256 MB. Timeout: 15 s.digest— EventBridge Scheduler target, weekly Monday 7am. Readstd-stateand the checklist; sends the preparer a digest summarizing files complete this week, files stuck, and longest-waiting clients. No Bedrock; a plain summary table. Memory: 256 MB.summary— EventBridge Scheduler target, monthly on the first Monday at 9am. Reads the past month’std-sends,td-state, andtd-audit; calls Bedrock Haiku 4.5 to write a one-paragraph practice narrative; emails it via SES to the configured partner list. Memory: 512 MB.
Storage
- DynamoDB ·
td-state— one row per checklist item per file. PK(client_id, item_id); attributes:status(waiting/received/accepted/rejected),upload_id,confirmed_type,confidence,reviewed_by. On-demand. - DynamoDB ·
td-sends— one row per dispatch. PK(client_id, step); attributes:sent_date,move(first_request/reminder/escalate),recipient,missing_count. On-demand. No TTL. - DynamoDB ·
td-uploads— one row per uploaded file. PK(client_id, upload_id); attributes:s3_key,matched_item,confidence,uploaded_at,review_state. On-demand. - DynamoDB ·
td-audit— one row per write action of any kind. PK(client_id, ts); attributes:action,by_user,before,after. On-demand. No TTL — this is the long-term audit trail. - S3 ·
td-clients-source— mirrored CSV from the Drive checklist sheet. Versioning enabled. Lifecycle to Glacier at 90 days; expiry at 7 years. - S3 ·
td-rules-source— mirrored rules and voice docs as plain text. Versioning enabled. - S3 ·
td-uploads— client-uploaded documents. Block all public access; SSE encryption; versioning enabled; lifecycle to Glacier at 180 days; expiry at 7 years. Access only via short-lived presigned URLs from the status board. - S3 ·
td-archive— prior-season files and documents, kept for reference when a returning client’s file is copied forward.
Bedrock
- Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. Two callsites:intake-classifyfor the per-upload type-confirmation, andsummaryfor the monthly practice narrative. Claude Sonnet 4.6 (global.anthropic.claude-sonnet-4-6-20250930-v1:0) is available as a fallback for uploads the Haiku pass flags as low-confidence, but in practice tax documents are recognizable enough that Haiku handles them, and a low-confidence upload routes to a human anyway. - Embeddings. Not used. The checklist is structured rows; deterministic lookup beats vector retrieval here. No Knowledge Base, no S3 Vectors.
- Quotas. Default account quotas are more than enough at small-practice volume. The tracker itself doesn’t call Bedrock; the classify lane fires once per uploaded document.
EventBridge Scheduler config
td-daily-tick—cron(0 8 * * ? *)in the practice’s timezone. Target:trackerLambda.td-drive-sync—rate(15 minutes). Target:drive-syncLambda.td-weekly-digest—cron(0 7 ? * MON *)in TZ. Target:digestLambda.td-monthly-summary—cron(0 9 ? * 2#1 *)(first Monday at 9am) in TZ. Target:summaryLambda.- One-off rules — created on the fly by
dispatchwhen a quiet-hours or holiday defer is needed. Useat(YYYY-MM-DDTHH:MM:SS)expressions with--action-after-completion DELETEso the rule self-cleans.
SES inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
docs.your-practice.com) toinbound-smtp.ap-southeast-1.amazonaws.comif you want clients to be able to reply or forward documents by email. - SES inbound rule set
td-inbound-rules: one rule with recipientdocs@your-practice.com→ spam scan → S3 PUT tos3://td-uploads/inbound/<message-id>→ stop. The S3 PUT triggersintake-classifyvia the same path as an upload. - SES outbound for the requests and reminders: verify a sender identity at
docs@your-practice.comwith DKIM and SPF on the parent domain. Out of sandbox by request.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- tracker role:
s3:GetObjecton the clients, rules, and voice keys;dynamodb:Query+GetItemontd-sends,td-state;events:PutEventson the default bus. Nobedrock:*. - dispatch role:
scheduler:CreateSchedulefor the deferred-send one-offs;secretsmanager:GetSecretValueon the upload signing key;ses:SendRawEmailfrom the verified sender identity;dynamodb:PutItemontd-sends. - upload-handler role:
s3:PutObjectontd-uploads;secretsmanager:GetSecretValueon the upload signing key;dynamodb:PutItemontd-uploads. - intake-classify role:
s3:GetObjectontd-uploads;textract:StartDocumentTextDetection+StartDocumentAnalysis;bedrock:InvokeModelon the Haiku ARN;dynamodb:PutItemontd-stateandtd-uploads. - action-handler role:
dynamodb:PutItemontd-stateandtd-audit;secretsmanager:GetSecretValueon the Sheets-API service-account secret; outbound network tosheets.googleapis.com;s3:GetObjectontd-uploadsfor presigned review links. - drive-sync role:
secretsmanager:GetSecretValueon the Google service-account secret;s3:PutObjecton the clients and rules buckets; outbound network towww.googleapis.com.
Secure upload and review flow
Upload links are signed tokens, not session cookies: an HMAC over client_id, file_set, and an expiry, signed with the key in td/upload/signing-key. upload-handler verifies the signature and expiry on every request; an expired link renders a “request a fresh link” page that triggers a new td.reminder. The bucket is fully private; the preparer’s status board generates short-lived presigned GET URLs on demand to render thumbnails and previews, so document bytes are never served from a durable public URL.
The status board itself is a small authenticated app (the practice’s staff log in); its action buttons post to action-handler with the preparer’s session. Client-facing surfaces (upload page, intake form) are unauthenticated but token-gated; staff-facing surfaces (status board, actions) require login. That split keeps clients out of each other’s files without making them manage a password.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms: tracker Lambda failures > 0 in a day (the daily tick has to run); intake-classify failure rate > 1% in 24h; upload-handler token-verification failures > 20/hour (might mean a leaked or stale link being retried).
- X-Ray: off by default. Not worth the cost at small-practice volume.
- AWS Budgets: $15/month threshold, alarm at 80% and 100%, posts to SNS topic
td-cost-alarmsubscribed to the on-call partner’s email.
Config and secrets
Service-account credentials for Drive and Sheets APIs live in Secrets Manager under td/drive/sa (one service account with scopes for both APIs). The upload signing key is td/upload/signing-key. SES sender identity lives in IAM and the verified-domain config. The configured timezone, holiday list reference, quiet-hours window, default due date, and the per-client-type checklists all live in Parameter Store under /td/config/ (with the larger checklist templates in the Drive rules doc). Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
Whichever IaC you prefer. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning and block-public-access for td-uploads so a client document is never exposed and a re-upload never silently overwrites, and version the EventBridge Scheduler timezone setting so you don’t accidentally start running the daily tick in UTC after a CI rotation. CDK with a Python stack file works well; SAM also fits, and matches the GitHub Actions + OIDC deploy with no long-lived keys. Total deployable surface: around nine Lambdas, four DynamoDB tables, four S3 buckets, one EventBridge rule on the default bus (plus the Scheduler rules), one SES rule set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your practice, see Work with me.
All posts