Part 3 of 7 · Receipt organizer series ~5 min read

How a receipt gets read

A receipt comes off the queue. The reader Lambda runs Amazon Textract on the image, gets back the vendor, date, total, and tax — each with a score for how sure Textract is — and then has to decide what to do with it. Most receipts are clean and file themselves. Some are blurry, some are duplicates, some aren’t even receipts. The whole decision is plain rules over Textract’s confidence scores. No guessing on the numbers that matter.

Key takeaways

  • Amazon Textract reads the receipt and returns vendor, date, total, and tax with a confidence score each.
  • The confidence threshold lives in the rules doc — below it, a human confirms before anything files.
  • Four results per receipt: filed, needs-review, duplicate, or rejected.
  • A duplicate check on vendor, date, and total stops the same purchase being claimed twice.
  • The reader never invents a number it couldn’t read — an unclear total goes to a human, not a guess.

The decision flow, per receipt

Decision flow per receipt as it is read A vertical decision flow diagram. At the top, an input box "Receipt from queue" with the receipt id, the stored image, the source, and the submitter. Below that, a step "Run Textract on the image" — Textract AnalyzeExpense returns vendor, date, total, and tax, each with a confidence score. Below that, a check "Is this a receipt at all?" — if Textract found no money fields and no vendor, route to "Rejected" (somebody forwarded a newsletter). If yes, continue. The next step "Check every field's confidence" — compares each field's score against the threshold in the rules doc. The next step "Look for a duplicate" — searches recent receipts for the same vendor, date, and total. If a match, route to "Duplicate" — flag it so the same purchase isn't claimed twice. If no match, look at the confidence result. If every field is above the threshold and the math checks out, route to "Filed" — a clean record. If any field is below the threshold, route to "Needs-review" — send the image and the read fields to the bookkeeper's queue. Each terminal box — Rejected, Duplicate, Filed, Needs-review — writes the result to the ro-receipts table with the read fields and the scores. A note at the bottom: the threshold lives in the rules doc — raise it and more receipts go to a human; lower it and more file on their own. Receipt from queue id · image · source · submitter Step 1 Run Textract on image vendor, date, total, tax Step 2 Is this a receipt at all? no money fields, no vendor Step 3 Check field confidence each score vs threshold Step 4 Look for a duplicate match → duplicate none → check scores Step 5 All fields above threshold? and the math checks out Rejected not a receipt Filed clean, confident Needs-review below threshold Duplicate already filed if no no read match clean below The threshold lives in the rules doc — raise it for more checks, lower it for more auto-filing.
Fig 3. The reader’s decision tree, per receipt. Five steps decide which of four results applies. Textract reads the fields; the rules doc holds the threshold; the reader only enforces it.

What Textract reads, and how sure it is

Textract has a feature built specifically for receipts and invoices — AnalyzeExpense. You hand it the image, and it hands back named fields: vendor name, transaction date, the total, the tax amount, and often the individual line items. The useful part is that each field comes with a confidence score from 0 to 100 — how sure Textract is that it read that value correctly. A crisp digital PDF scores in the high 90s on every field. A crumpled photo of faded thermal paper might read the vendor fine but score the total at 71 because one digit is smudged.

That score is the whole game. The reader doesn’t treat “$83.40 at 71% sure” and “$83.40 at 99% sure” the same way. The first one is a question for a human; the second one files itself. The rules doc holds the threshold — the line between “file it” and “ask a human” — with a sensible default of 90 for the money fields (total and tax) and a slightly lower bar for vendor name, where a small misread matters less.

Four results, always

Every receipt, once read, lands in exactly one of four buckets. The names are plain on purpose.

  • Filed. Every field read cleanly, the total and tax are above the money threshold, the math is sane (tax isn’t larger than the total), and it’s not a duplicate. The record is written straight to the expense sheet. Most receipts, most days, file themselves.
  • Needs-review. At least one field came back below the threshold — a blurry total, a date that didn’t parse, a vendor Textract couldn’t read. The receipt goes to the bookkeeper’s queue with the image and the fields it could read, so a human can confirm the one unclear value in a few seconds.
  • Duplicate. A receipt with the same vendor, date, and total already exists. This catches the common case where someone forwards the email and snaps the paper copy, or uploads the same batch twice. It’s flagged, not filed, so the same lunch never gets claimed twice.
  • Rejected. Textract found no money fields and no vendor — it’s not a receipt. Somebody forwarded a newsletter, a calendar invite, or a blank photo. It’s set aside in a rejected folder with a one-line note, so nothing clutters the books and nothing is silently lost either.

The duplicate check, in plain terms

Before a clean receipt files, the reader does one more look: it searches the recent records for the same vendor, the same date, and the same total. If it finds one, the new receipt is marked a duplicate. The check is deliberately strict — all three have to match — because two genuine $12.00 coffees from the same café on the same day are possible, and the goal isn’t to block real expenses. When in doubt, a near-match is sent to review rather than auto-rejected, so a human makes the final call on the borderline ones.

Why the reader follows rules instead of guessing

The reader could let the AI model decide everything — read the receipt, pick the numbers, file it. It doesn’t. Two reasons. First, the money fields are the one place a wrong value does real damage: a total off by a digit flows into the books, the tax return, and an auditor’s spreadsheet. So those fields are gated on Textract’s own confidence score, and anything shaky goes to a human. Second, the read itself should be predictable — the same image always produces the same fields and the same result, so a bookkeeper can trust what they’re seeing.

The AI model does play a part — but in the next step, not this one. Reading the fields is Textract’s job. Deciding which category a clean receipt belongs to is where Bedrock comes in. That’s the next post.

All posts