Part 3 of 7 · Document pipeline series ~5 min read

How the AI reads a document

Reading a document is two jobs in one. First, find the words and the layout. Second, understand what they mean for this kind of document. The pipeline uses two AIs — one specialist, one generalist — to do each job.

Two AIs in sequence: layout specialist, then meaning generalist A vertical pipeline. At the top, a box labeled “Raw document” representing whatever cleared the intake step. An arrow leads down to the first AI step, “The specialist”, which finds words, tables, signatures, and layout. Below that, an arrow leads to a second AI step, “The generalist”, which structures the extracted layout into named fields appropriate to the document type (invoice, receipt, contract, etc.). At the bottom, a final box labeled “Structured fields” with a sub-label noting that each field comes with a confidence score the validator will use later. Side annotations show what comes out of each stage: after the specialist, raw text plus layout; after the generalist, named fields with types. A bottom note reads: two AIs, one job each. The combo is more accurate than either alone. Raw document cleared intake, ready to read The specialist finds words, tables, signatures, layout output: raw text + layout coords structure The generalist turns layout into named fields per document type output: named fields + confidence JSON Structured fields each with a confidence score for the validator Two AIs, one job each. The combo is more accurate than either alone.
Fig 3. Layout specialist first, meaning generalist second. Output: structured fields the validator can score.

Two AIs, one job each

You could in theory hand the whole document to a single big AI and ask “please extract the fields.” It works — sometimes. It also costs more, makes more mistakes on tables and signatures, and gets worse the longer the document gets.

The pipeline splits the work in two. A small specialist does the boring half (where are the words?). A small generalist does the interesting half (what do they mean?). Each tool gets used for what it’s good at.

The specialist: layout

The first AI is built specifically for documents. It finds words on the page and tells you exactly where they are. It recognises tables and gives you each cell. It detects signatures, checkmarks, and form fields. It does this consistently — the same scan twice gives you the same result.

What it does not do: decide which words are the “invoice number” and which are the “customer reference.” That’s not its job. It hands the next stage a clean map of the page.

The generalist: meaning

The second AI is the kind that can be told what to look for in plain language. The pipeline tells it: “this is an invoice. Look for these fields: vendor name, invoice number, total amount, due date, line items.” The AI reads the layout from stage one and fills out the form.

Crucially, the generalist isn’t scanning the original document — it’s working from the specialist’s clean map. Faster, cheaper, more accurate.

Why two not one

Three reasons:

  • Accuracy. Specialist tools beat generalist ones at the parts they specialise in. Asking a generalist to find table cells inside a scanned PDF is a recipe for hallucinated numbers.
  • Cost. The specialist is cheap per page and predictable. The generalist is also cheap, but only when it’s working on a small clean map — not a giant blob of raw scan.
  • Trust. When something goes wrong, you can tell which AI got it wrong. The split makes the system debuggable.

What comes out the other end

For each field the generalist extracts, you get three things:

  • The field name (vendor, total, due date).
  • The value, in the right shape (a number for “total”, a date for “due date”).
  • A confidence score — how sure the AI is.

That confidence score is the next post’s entire subject — it’s how the validator decides whether a document goes straight through or needs a human to peek at it for two seconds.

In plain words

One AI finds the words. Another AI decides what they mean. Together they read a document better, faster, and cheaper than either could alone — and crucially, when they’re wrong, the system knows.

All posts