How a vendor document gets checked
A vendor drops a PDF on their upload page. Now the onboarder has to work out what it is, whether it fills a gap in the checklist, and whether it’s any good — a certificate of insurance that expired last month is present but useless. The reading uses a model; the deciding is plain Python; and a person confirms before the item turns green. This post follows one uploaded file from the moment it lands to the moment it counts.
Key takeaways
- An upload triggers the checker once — Textract reads the page, Haiku 4.5 reads the fields.
- Plain Python matches the file to a required item and checks both present and in-date.
- An expired insurance certificate is present but fails the date check — it stays not-done.
- A person confirms the read before the item counts; a misread date is caught here, not later.
- The checker calls a model only on upload — never on the daily chase tick.
The check flow, per uploaded file
Reading the file: a model, used narrowly
When a file lands in the vendor’s folder, the S3 write triggers the checker Lambda. It runs Amazon Textract on the file — Textract is a managed service that reads the text and tables out of PDFs and images, so a scanned certificate works as well as a clean PDF. The extracted text then goes to one Bedrock Haiku 4.5 call with a tight prompt: “Here is the text of a document a vendor uploaded. Tell me which of these it is — bank details, tax form, insurance certificate, signed agreement, or unknown — and pull these fields if present: the tax-ID, the policy expiry date, the bank account name. Return JSON only. Do not invent a date that isn’t in the text.”
That’s the only place a model is involved. It reads; it doesn’t decide anything. Everything after this is plain Python comparing what the model pulled against the checklist rules.
Two checks: present, and in date
A document can fail in two different ways, and the checker tests for both.
Present. Does this file actually fill a gap on the checklist? The Python maps the model’s proposed type to a required item for this vendor. If the vendor owes an insurance certificate and the upload reads as one, that item is now a candidate to mark present. If the file reads as something the vendor doesn’t owe — or as “unknown” — it’s held as unmatched for a person to look at, rather than silently dropped.
In date. Some documents have an expiry; a certificate of insurance is the clearest case, and some tax forms are only valid for the current year. The checklist doc says which items carry a date rule. For those, the Python compares the expiry date the model pulled against today. An insurance certificate that expired last month is present but fails the date check, so the item stays not-done with a note: “Insurance certificate received, but it expired on 2026-04-30 — please upload a current one.” This is the check that catches the most common real-world problem: a supplier who genuinely sent a certificate, just last year’s.
Why a human confirms every read
Even a good model misreads a smudged date or a tax-ID with an extra digit now and then. For a vendor file, a wrong read is dangerous in a quiet way: it can mark an item done when it isn’t, and the system would then stop chasing it. So every read goes to a one-tap confirmation. The owner (or whoever set up the vendor) sees the proposed item, the pulled fields, and a thumbnail, with two buttons: confirm and fix. On confirm, the checklist row in DynamoDB marks the item present, in date, and confirmed. On fix, they correct the field — usually a date — and the corrected value is what’s stored.
The confirmation is deliberately lightweight, because most reads are right and you don’t want to make the common case slow. But the principle holds: a document counts as done only when a person has agreed it’s the right document and the read is correct. The cost of a wrong “done” here is a vendor approved on paperwork that wasn’t really valid.
State that makes the checklist trustworthy
The vendor’s checklist lives in one DynamoDB row per vendor, with a small map for each required item: (item, status, present, in_date, expiry, confirmed_by, confirmed_at). Every upload updates exactly one item’s entry. Because the state is explicit, the chase tick in the next post never has to re-read a document — it just looks at which items are still not done. Re-uploading a replacement (a current certificate over an expired one) overwrites that one item’s entry and leaves the rest alone.
Why the daily tick uses no model
The checker calls Textract and Bedrock only when a file is uploaded — a handful of times per vendor, total. The daily chase tick in Part 4 reads the same DynamoDB row and calls no model at all; it just compares the checklist state against the calendar. That split keeps the cost down (you pay to read a document once, not every day) and keeps the part that runs every day completely predictable.
Next post: how a vendor gets chased for the items still missing — the daily tick, the four moves, and how a reminder lists only what’s left.
All posts