How a receipt gets read

Key takeaways

Amazon Textract reads the receipt and returns vendor, date, total, and tax with a confidence score each.
The confidence threshold lives in the rules doc — below it, a human confirms before anything files.
Four results per receipt: filed, needs-review, duplicate, or rejected.
A duplicate check on vendor, date, and total stops the same purchase being claimed twice.
The reader never invents a number it couldn’t read — an unclear total goes to a human, not a guess.

The decision flow, per receipt

Fig 3. The reader’s decision tree, per receipt. Five steps decide which of four results applies. Textract reads the fields; the rules doc holds the threshold; the reader only enforces it.

What Textract reads, and how sure it is

Textract has a feature built specifically for receipts and invoices — AnalyzeExpense. You hand it the image, and it hands back named fields: vendor name, transaction date, the total, the tax amount, and often the individual line items. The useful part is that each field comes with a confidence score from 0 to 100 — how sure Textract is that it read that value correctly. A crisp digital PDF scores in the high 90s on every field. A crumpled photo of faded thermal paper might read the vendor fine but score the total at 71 because one digit is smudged.

That score is the whole game. The reader doesn’t treat “$83.40 at 71% sure” and “$83.40 at 99% sure” the same way. The first one is a question for a human; the second one files itself. The rules doc holds the threshold — the line between “file it” and “ask a human” — with a sensible default of 90 for the money fields (total and tax) and a slightly lower bar for vendor name, where a small misread matters less.

Four results, always

Every receipt, once read, lands in exactly one of four buckets. The names are plain on purpose.

Filed. Every field read cleanly, the total and tax are above the money threshold, the math is sane (tax isn’t larger than the total), and it’s not a duplicate. The record is written straight to the expense sheet. Most receipts, most days, file themselves.
Needs-review. At least one field came back below the threshold — a blurry total, a date that didn’t parse, a vendor Textract couldn’t read. The receipt goes to the bookkeeper’s queue with the image and the fields it could read, so a human can confirm the one unclear value in a few seconds.
Duplicate. A receipt with the same vendor, date, and total already exists. This catches the common case where someone forwards the email and snaps the paper copy, or uploads the same batch twice. It’s flagged, not filed, so the same lunch never gets claimed twice.
Rejected. Textract found no money fields and no vendor — it’s not a receipt. Somebody forwarded a newsletter, a calendar invite, or a blank photo. It’s set aside in a rejected folder with a one-line note, so nothing clutters the books and nothing is silently lost either.

The duplicate check, in plain terms

Before a clean receipt files, the reader does one more look: it searches the recent records for the same vendor, the same date, and the same total. If it finds one, the new receipt is marked a duplicate. The check is deliberately strict — all three have to match — because two genuine $12.00 coffees from the same café on the same day are possible, and the goal isn’t to block real expenses. When in doubt, a near-match is sent to review rather than auto-rejected, so a human makes the final call on the borderline ones.

Why the reader follows rules instead of guessing

The reader could let the AI model decide everything — read the receipt, pick the numbers, file it. It doesn’t. Two reasons. First, the money fields are the one place a wrong value does real damage: a total off by a digit flows into the books, the tax return, and an auditor’s spreadsheet. So those fields are gated on Textract’s own confidence score, and anything shaky goes to a human. Second, the read itself should be predictable — the same image always produces the same fields and the same result, so a bookkeeper can trust what they’re seeing.

The AI model does play a part — but in the next step, not this one. Reading the fields is Textract’s job. Deciding which category a clean receipt belongs to is where Bedrock comes in. That’s the next post.

All posts