Part 3 of 7 · Email assistant series ~5 min read

How the assistant reads an email

A real email is mostly not the message. It’s a quoted thread, a signature, a legal footer, a tracking pixel, and one short paragraph of actual question. The reader’s only job is to find that paragraph.

Reading an email: peel the layers A horizontal flow with five stages. On the left, “Raw email” — the full message as it arrived, including HTML, attachments, headers, quoted history, signature block, and trackers. An arrow leads to the first peeling stage: “HTML to plain text”. Then to “Strip quoted thread”. Then to “Strip signature”. Then to “Strip footer and trackers”. Each stage is a small box with one or two-line description. On the right, the final box: “Clean message” — one short paragraph that captures what the sender actually wrote. A note at the bottom reads: a typical four-paragraph reply email becomes a two-line question. The brain costs less, focuses better. Raw email HTML body + quoted thread + signature + legal footer + trackers ~ 4 KB typical stage 1 HTML → plain text drop tags, keep words stage 2 Strip quoted thread cut at “On Tue…wrote:” stage 3 Strip signature below the “-- ” line stage 4 Strip footer + trackers disclaimers, pixel images Clean message just what the sender actually wrote ~ 200 bytes typical A four-paragraph reply email becomes a two-line question. Brain costs less, focuses better.
Fig 3. Four small steps. The brain only ever sees the last box.

Why bother with parsing?

Three reasons. The first is cost. The brain charges per word it reads. A typical reply email is 90% quoted history and 10% new content. If you send the whole thing to the model, you pay to re-read what you already saw. Stripping the quoted thread alone usually cuts model input by five to ten times.

The second is accuracy. Models get confused when half the input is the previous reply quoted back at them. They start summarising the thread instead of answering the new question. Feed them just the new content and they answer the new question.

The third is privacy. The thread might carry personal details the original sender never expected an AI to read — old phone numbers, billing info, screenshots. Stripping it before any model sees it is the cleanest way to respect that.

Stage 1 — HTML to plain text

Most email arrives as HTML even when it looks like plain text. Inline styles, table layouts, image links, conditional Outlook markup. The first stage drops every tag and keeps the text content, normalising whitespace and resolving common entities.

The output looks like what you’d see if you copied the email body and pasted it into a notes app. No formatting, no images, just words.

Stage 2 — Strip the quoted thread

Email clients all use slightly different quoting conventions, but the patterns are predictable. The reader cuts the message at the first occurrence of any of these:

  • On <date>, <name> wrote: — Gmail, most modern clients.
  • From: <name> Sent: <date> — Outlook’s header-style quoting.
  • A line of > characters at column zero — older clients, plain-text mailing lists.
  • --- Original Message --- — legacy email forwarding.

Whatever sits above the cut is the new content. Whatever sits below it is history that’s already in the audit log if anyone needs it.

Stage 3 — Strip the signature

Signatures are noisier than they look. “Sent from my iPhone” counts. So does the four-line title-and-phone-number block, the marketing tagline, the social media icons, the “think before you print” environmental footer.

The convention is a line containing exactly -- (two dashes and a space) above the signature. When that signal isn’t there, the reader uses heuristics: lines that look like contact details, lines that match common closings (“Best,” “Thanks,” “Regards,” the sender’s name on its own line), short last paragraphs that look like a sign-off.

A note: the reader keeps the sender’s name and contact details in a separate field on the message envelope — just removed from the body the brain reads. The brain doesn’t need the signature to know who sent the email; the envelope already says so.

Stage 4 — Strip footers and trackers

Bulk-mail and corporate footers (“CONFIDENTIALITY NOTICE”, unsubscribe links, image trackers, marketing disclaimers) get cut last. These are usually long, never relevant to the message, and almost always anchored on phrases like:

  • “This email and any attachments are confidential…”
  • “Please consider the environment…”
  • “You received this email because…”

Anchor matched, drop everything below.

What lands at the brain

By the end of stage 4, what’s left is what the sender meant to write. A two-line question. A three-line request. A short paragraph asking about pricing. The brain reads this, plus the sender’s name and email address from the envelope, and that’s the input it makes a decision on.

A small concession: the reader stores the original raw email in S3, alongside the cleaned version. If a reply later turns out to be wrong because something important was in the signature or the quoted history, the audit trail has both. Cleanup is conservative, not destructive.

Attachments and weird shapes

The reader handles three attachment shapes pragmatically:

  • Images — the file is stored in S3, the body keeps a placeholder “[image: filename]” so the brain knows there was one.
  • PDFs and documents — same treatment; if your business actually needs the contents, that’s a separate workflow (and the document pipeline series covers exactly this).
  • Calendar invites and vCards — flagged in the envelope and routed to escalate by default. The brain doesn’t accept invites on your behalf.

In plain words

Real email is messy. Most of what arrives is noise that humans skip without thinking, but a language model would patiently read every word of. The reader does the skipping first, in cheap plain code, and hands the brain only what the sender meant to say. Smaller input, sharper output, lower bill.

All posts