Part 2 of 7 · Supplier bill matcher series ~4 min read

How a supplier bill gets read

The matcher can only check what it can read. So the first job is turning a supplier bill — which arrives as a PDF, a portal entry, or a scan somebody dropped in a folder — into clean, structured lines: this item, this many units, this unit price. There are three ways a bill gets in, and they all end the same way: Textract reads the page, and one short Bedrock call tidies the read into lines the matcher can compare. No bill skips the read step, because a bill the matcher can’t read is a bill that can’t be checked.

Key takeaways

  • Three intake lanes feed one reader: an emailed-PDF lane, a supplier-portal poll, and a manual upload.
  • Every bill is read by Textract; Bedrock Haiku 4.5 turns the read text into clean structured lines.
  • The reader extracts the supplier, bill number, PO reference, and each line’s item, quantity, and price.
  • Low-confidence reads are held for a human to confirm before the bill goes to the matcher.
  • The read step is where AI earns its place — the match itself, in Part 3, uses no model.

Three lanes into one reader

Three intake lanes funnel into one reader A diagram with three vertical lane columns at the top and a single unified row at the bottom. Lane one, Emailed PDF: somebody forwards a supplier bill PDF to a dedicated address, bills-at-your-company; SES writes the raw email to S3, and an S3 event starts the read. Lane two, Supplier portal: a portal-poll Lambda runs on a schedule, signs in to supplier portals that publish bills, and downloads any new bill PDFs to S3, where the same read starts. Lane three, Manual upload: somebody drops a scanned or downloaded bill straight into an upload folder in S3; the S3 event starts the read just like the other lanes. All three lanes converge on the same reader: Textract reads the page into text and tables, then one Bedrock Haiku 4.5 call turns the read into clean structured lines — supplier, bill number, PO reference, and per line the item, quantity, and unit price — and writes them to the bm-bills table for the matcher. A note at the bottom: every bill is read the same way no matter how it arrived — and a low-confidence read is held for a human to confirm before it reaches the matcher. Lane 1 · SES Emailed PDF • Forward bill to bills-address • SES writes email to S3 • S3 event starts the read • Most common lane for SMBs Lane 2 · scheduled poll Supplier portal • portal-poll Lambda runs on a schedule • Signs in, pulls any new bill PDFs • Writes them to S3; same read starts • No forwarding needed Lane 3 · drop a file Manual upload • Drop a scan or PDF in the upload folder • S3 event starts the read • Same reader as Lanes 1 and 2 • For paper bills and one-offs Reader: Textract reads, Bedrock cleans into lines supplier · bill number · PO reference · per line: item · quantity · unit price clean lines written to the bm-bills table — matcher reads from there to matcher, next post Every bill is read the same way — and a low-confidence read is held for a human to confirm first.
Fig 2. Three lanes converge on one reader. However a bill arrives, it is read by Textract and cleaned into structured lines by a single Bedrock call. The clean lines land in the bm-bills table, which the matcher reads in the next post.

Lane 1: emailed PDF (the lane most teams actually use)

Set up a dedicated inbound address — something like bills@your-company.com — via Amazon SES. Suppliers email their bills there directly, or anyone on the team forwards a bill they received. SES writes the raw email to s3://bm-raw-mime/. The S3 PUT triggers a reader Lambda that walks the email to the PDF attachment and starts the read.

This is the lane most small businesses live in. Suppliers already email bills; pointing those emails at one address is the whole setup. If a bill arrives as a link instead of an attachment, the reader follows the link, downloads the PDF, and proceeds the same way.

Lane 2: supplier portal

Some suppliers don’t email bills at all — they post them to a portal and expect you to log in and fetch them. A small portal-poll Lambda runs on a schedule (a few times a day), signs in to each configured portal using credentials stored in Secrets Manager, and downloads any new bill PDFs to s3://bm-uploads/. From there the same S3 event starts the same read. The team never has to remember to log in and check; the poll does it.

Portals change their sign-in flow now and then, so this lane is the most maintenance-prone of the three. It’s worth it only for suppliers that won’t email — for everyone else, Lane 1 is simpler and more reliable.

Lane 3: manual upload

Paper still happens. A bill arrives in the post, or as a photo in a text message, or as a one-off a supplier handed over at delivery. For those, someone scans or saves the file and drops it straight into an upload folder that maps to s3://bm-uploads/. The S3 event starts the read exactly as the other lanes do. This lane is the catch-all: anything that didn’t come by email or portal still gets read and checked the same way.

The read step: Textract, then one Bedrock call

Whatever lane a bill came in through, the reader does the same two things. First, Amazon Textract reads the PDF — it handles PDF, PNG, JPEG, and TIFF, including scans and photos, and returns the text plus any tables it found on the page. Bills are table-heavy (a list of line items with quantities and prices), so Textract’s table reading does most of the heavy lifting here.

Second, one Bedrock Haiku 4.5 call turns that raw read into clean, structured lines. Suppliers lay bills out a hundred different ways — the quantity column might be labelled “Qty” or “Units” or nothing at all; the unit price might be before tax or after; a single line might wrap across two rows. The model’s job is narrow and well-defined: take the messy table and emit a tidy list — supplier, bill number, PO reference if present, and per line the item code, item name, quantity, and unit price. The prompt is short: “Return JSON only. Copy numbers exactly as printed. Mark each field with a confidence score. Do not invent a value that isn’t on the page.”

This is the one place in the whole system where a model earns its place. Reading wildly varied real-world bill layouts is exactly what a model is good at, and exactly what a pile of brittle parsing rules is bad at. But the model only reads — it never decides whether the bill is correct. That decision, covered in Part 3, is plain Python.

Low-confidence reads are held, not guessed

Each field the reader emits carries a confidence score. If a key field — the bill total, a line quantity, a unit price — comes back low-confidence (a smudged scan, an unusual layout), the bill is held in a needs-review state and a person is asked to confirm the read before it goes to the matcher. The reason is the same one that runs through this whole system: a bill the matcher checked against a misread quantity is worse than a bill that waited two minutes for a human to glance at it. The misread one passes a check it should have failed.

Next post: how the matcher takes those clean lines, lines them up against the purchase order and the goods-received note, and picks one of four outcomes — with no model anywhere in the decision.

All posts