Part 3 of 7 · Tax doc collector series ~5 min read

How a document arrives and gets checked

A client clicks the upload link in their request and lands on a secure page. They drop in a PDF of their W-2. From there the collector has to store it safely, read it just enough to recognize what kind of document it is, match it to the right item on that client’s checklist, and check it off — without ever pretending it’s done the preparer’s job. The whole flow is a few small steps, and the one model call answers exactly one narrow question.

Key takeaways

  • The upload page is a Function URL with a signed, time-limited link — no password for the client to remember.
  • Files land in a private, versioned S3 bucket; nothing is public.
  • Textract reads the text; Bedrock Haiku 4.5 answers one question — which checklist item is this?
  • A confident match checks the item off; a low-confidence match goes to the preparer to file by hand.
  • The collector confirms the document type only. It never reads numbers for the return.

The check flow, per upload

Decision flow per uploaded document A vertical decision flow diagram. At the top, an input box "Upload on secure page" with the client id, the file, and the signed link token. Below that, a step "Store file in private S3" — the file is written to a versioned, private bucket keyed by client and a fresh id. Below that, a check "Token still valid?" — if no, route to "Hold or refile" (the upload waits and the client is sent a fresh link). If yes, continue. The next step "Textract reads the document" — pulls the text and any tables from the PDF or photo. The next step "Haiku 4.5: which checklist item?" — one narrow model call that names the most likely checklist item, for example a W-2, with a confidence score; it confirms the document kind only and never reads figures for the return. If the match is low-confidence the upload also routes to "Hold or refile" for a human to look at. If the match is confident, the next check "On this client's checklist?" decides: if yes, route to "Checked off" (mark the item received, pending review); if the confident match isn't on the checklist, route to "To preparer". Each terminal box — Hold or refile, Checked off, To preparer — writes a row recording what happened. A note at the bottom: the collector confirms the kind of document only — a human reviews the contents before anything is marked final. Upload on secure page client · file · link token Step 1 Store file in private S3 versioned · keyed by client Step 2 Token still valid? signed, time-limited link Step 3 Textract reads the doc text + tables from PDF/photo Step 4 Which checklist item? no match → to preparer low confidence → to preparer Step 5 On this client's checklist? read the file's open items Hold or refile stale link / unsure Checked off received, pending review To preparer not on the checklist To preparer off-list match if no unsure off-list yes no match The collector confirms the kind of document only — a human reviews the contents before final.
Fig 3. The check flow, per upload. Five steps decide whether a document checks off an item, gets held, or goes to the preparer. The model answers one question — which checklist item is this — and never reads figures for the return.

The secure upload page

The request email carries a link like https://<function-url>/u/<token>. The token is signed and time-limited — it carries the client id and an expiry, and it’s signed with a key in Secrets Manager so it can’t be forged or guessed. The client doesn’t need a password; the link is the credential. When they open it, a Function URL Lambda checks the signature and expiry, then serves a plain upload page that lists exactly which items are still missing, with a drop zone for each. They can upload a PDF or a photo from their phone.

If a client tries an old link — say, one from last week that’s now expired — the page shows a friendly “this link has expired, here’s a fresh one” message and emails them a new link. Links expire on purpose: a tax document link that lives forever in an old email is a small risk that’s easy to remove.

Storing the file safely

The uploaded file goes straight into a private S3 bucket, keyed by client and a fresh id: s3://td-uploads/<client_id>/<upload_id>. The bucket blocks all public access, has versioning on (so a re-upload never silently overwrites), and is encrypted at rest. Nothing about a client’s documents is ever served from a public URL; the preparer views them through the status board, which generates short-lived signed links on demand. The S3 PUT triggers the next step.

Reading just enough to recognize the document

Amazon Textract reads the file — it handles PDF, PNG, JPEG, and TIFF natively, which covers a scan, a phone photo, or a download from a payroll portal. Textract returns the text and any tables. Then a single Bedrock Haiku 4.5 call answers one narrow question: of the items still open on this client’s checklist, which one does this document look like? The prompt is short and bounded: “Here is the text of an uploaded document and a list of the checklist items still open for this client. Return the single best-matching item and a confidence score. If none match, say so. Do not extract any amounts or figures.”

That last sentence matters. The collector is not reading the wages off a W-2 or the interest off a mortgage statement — that’s the preparer’s job, and getting it wrong silently would be far worse than not doing it at all. The model’s only job is to recognize what kind of document this is so the right box gets checked.

A confident match checks off; everything else goes to a human

If the match is confident and the named item is on this client’s checklist, the collector marks that item received, pending review in the file and writes a row to the td-uploads DynamoDB table linking the upload to the item. The client’s next reminder (if any) will no longer list that document. The item is not yet accepted — that happens only when the preparer reviews it in Part 5.

Two cases route to the preparer instead. If the model is unsure — a blurry photo, an unusual layout, a document it can’t confidently name — the upload lands in a small “needs filing” queue on the status board for a human to match by hand. And if the document clearly matches something that isn’t on this client’s checklist — the client uploaded a brokerage statement when their checklist has no investment item — it also goes to the preparer, who can add the item to the checklist or set the document aside. Both cases are normal, and both keep a person in the loop exactly where a machine would be guessing.

Why this shape

The expensive, risky work in document collection isn’t storing files — it’s the silent mistake: a document filed against the wrong client, or a number misread into a return. This design removes the second risk entirely by never reading figures, and it keeps the first risk small by checking the signed token, keying every file to a client, and sending anything uncertain to a human. The one model call is cheap, bounded, and easy to reason about: it answers one question and is wrong in only the safe direction (it sends things to a person rather than guessing).

Next post: how the daily tick reads each file, sees what’s still missing, and decides whether to send a first request, a reminder, an escalation, or nothing at all.

All posts