How the AI reads a document
Reading a document is two jobs in one. First, find the words and the layout. Second, understand what they mean for this kind of document. The pipeline uses two AIs — one specialist, one generalist — to do each job.
Two AIs, one job each
You could in theory hand the whole document to a single big AI and ask “please extract the fields.” It works — sometimes. It also costs more, makes more mistakes on tables and signatures, and gets worse the longer the document gets.
The pipeline splits the work in two. A small specialist does the boring half (where are the words?). A small generalist does the interesting half (what do they mean?). Each tool gets used for what it’s good at.
The specialist: layout
The first AI is built specifically for documents. It finds words on the page and tells you exactly where they are. It recognises tables and gives you each cell. It detects signatures, checkmarks, and form fields. It does this consistently — the same scan twice gives you the same result.
What it does not do: decide which words are the “invoice number” and which are the “customer reference.” That’s not its job. It hands the next stage a clean map of the page.
The generalist: meaning
The second AI is the kind that can be told what to look for in plain language. The pipeline tells it: “this is an invoice. Look for these fields: vendor name, invoice number, total amount, due date, line items.” The AI reads the layout from stage one and fills out the form.
Crucially, the generalist isn’t scanning the original document — it’s working from the specialist’s clean map. Faster, cheaper, more accurate.
Why two not one
Three reasons:
- Accuracy. Specialist tools beat generalist ones at the parts they specialise in. Asking a generalist to find table cells inside a scanned PDF is a recipe for hallucinated numbers.
- Cost. The specialist is cheap per page and predictable. The generalist is also cheap, but only when it’s working on a small clean map — not a giant blob of raw scan.
- Trust. When something goes wrong, you can tell which AI got it wrong. The split makes the system debuggable.
What comes out the other end
For each field the generalist extracts, you get three things:
- The field name (vendor, total, due date).
- The value, in the right shape (a number for “total”, a date for “due date”).
- A confidence score — how sure the AI is.
That confidence score is the next post’s entire subject — it’s how the validator decides whether a document goes straight through or needs a human to peek at it for two seconds.
In plain words
One AI finds the words. Another AI decides what they mean. Together they read a document better, faster, and cheaper than either could alone — and crucially, when they’re wrong, the system knows.
All posts