How the assistant reads an email

Key takeaways

Four stages run in plain code before any model token is spent: HTML-to-text, strip the quoted thread, strip the signature, strip footers and trackers.
Quoted-thread cuts trigger on four predictable patterns: Gmail’s On <date>, <name> wrote:, Outlook’s From: … Sent: …, lines of > at column zero, and legacy --- Original Message ---.
Signatures cut on the RFC -- line first, then heuristic closings (“Best,”, “Thanks,”) — the sender’s name and contact details survive on the message envelope, just not in the body the brain reads.
A typical reply is 90% quoted history and 10% new content; stripping the thread cuts model input five to ten times and the bill drops with it.
Cleanup is conservative: the original raw MIME stays in S3 alongside the cleaned body, so any wrong reply can be traced back to what the sender actually wrote.

Fig 3. Four small steps. The brain only ever sees the last box.

Why bother with parsing?

Three reasons. The first is cost. The brain charges per word it reads. A typical reply email is 90% quoted history and 10% new content. If you send the whole thing to the model, you pay to re-read what you already saw. Stripping the quoted thread alone usually cuts model input by five to ten times.

The second is accuracy. Models get confused when half the input is the previous reply quoted back at them. They start summarising the thread instead of answering the new question. Feed them just the new content and they answer the new question.

The third is privacy. The thread might carry personal details the original sender never expected an AI to read — old phone numbers, billing info, screenshots. Stripping it before any model sees it is the cleanest way to respect that.

Stage 1 — HTML to plain text

Most email arrives as HTML even when it looks like plain text. Inline styles, table layouts, image links, conditional Outlook markup. The first stage drops every tag and keeps the text content, normalising whitespace and resolving common entities.

The output looks like what you’d see if you copied the email body and pasted it into a notes app. No formatting, no images, just words.

Stage 2 — Strip the quoted thread

Email clients all use slightly different quoting conventions, but the patterns are predictable. The reader cuts the message at the first occurrence of any of these:

On <date>, <name> wrote: — Gmail, most modern clients.
From: <name> Sent: <date> — Outlook’s header-style quoting.
A line of > characters at column zero — older clients, plain-text mailing lists.
--- Original Message --- — legacy email forwarding.

Whatever sits above the cut is the new content. Whatever sits below it is history that’s already in the audit log if anyone needs it.

Stage 3 — Strip the signature

Signatures are noisier than they look. “Sent from my iPhone” counts. So does the four-line title-and-phone-number block, the marketing tagline, the social media icons, the “think before you print” environmental footer.

The convention is a line containing exactly -- (two dashes and a space) above the signature. When that signal isn’t there, the reader uses heuristics: lines that look like contact details, lines that match common closings (“Best,” “Thanks,” “Regards,” the sender’s name on its own line), short last paragraphs that look like a sign-off.

A note: the reader keeps the sender’s name and contact details in a separate field on the message envelope — just removed from the body the brain reads. The brain doesn’t need the signature to know who sent the email; the envelope already says so.

Stage 4 — Strip footers and trackers

Bulk-mail and corporate footers (“CONFIDENTIALITY NOTICE”, unsubscribe links, image trackers, marketing disclaimers) get cut last. These are usually long, never relevant to the message, and almost always anchored on phrases like:

“This email and any attachments are confidential…”
“Please consider the environment…”
“You received this email because…”

Anchor matched, drop everything below.

What lands at the brain

By the end of stage 4, what’s left is what the sender meant to write. A two-line question. A three-line request. A short paragraph asking about pricing. The brain reads this, plus the sender’s name and email address from the envelope, and that’s the input it makes a decision on.

A small concession: the reader stores the original raw email in S3, alongside the cleaned version. If a reply later turns out to be wrong because something important was in the signature or the quoted history, the audit trail has both. Cleanup is conservative, not destructive.

Attachments and weird shapes

The reader handles three attachment shapes pragmatically:

Images — the file is stored in S3, the body keeps a placeholder “[image: filename]” so the brain knows there was one.
PDFs and documents — same treatment; if your business actually needs the contents, that’s a separate workflow (and the document pipeline series covers exactly this).
Calendar invites and vCards — flagged in the envelope and routed to escalate by default. The brain doesn’t accept invites on your behalf.

In plain words

Real email is messy. Most of what arrives is noise that humans skip without thinking, but a language model would patiently read every word of. The reader does the skipping first, in cheap plain code, and hands the brain only what the sender meant to say. Smaller input, sharper output, lower bill.

All posts