Part 2 of 7 · Document pipeline series ~4 min read

How a document enters the pipeline

Documents arrive three ways — uploaded, emailed, or dropped into a shared folder. The intake step gets them into one shared place, screens them for trouble, and tells the rest of the pipeline what type each one is.

The intake: three ways in, one shared place, then a virus scan At the top, three boxes side by side describing the three ways a document can arrive: an upload form, an email forward, or a shared folder drop. All three feed arrows down into a single Intake Lambda, which writes the document to a shared S3 bucket as a fresh raw file. From the Intake Lambda, an arrow leads down to a Virus scan step. The virus scan then splits into two paths. The left path is labeled “pass” and leads to a box that says “forward to Reader”, the next stage in the pipeline. The right path is labeled “fail” and leads to a box labeled Quarantine, with a note that the operator is notified. A bottom note reads: every arriving document gets the same treatment, regardless of where it came from. Upload form a small page on your site Email forward docs@yourdomain Folder drop scanner or shared drive folder Intake writes to one shared place, attaches metadata save Virus scan runs the moment a new file lands pass fail Forward to Reader next stage in the pipeline Quarantine operator notified, file held for review Every arriving document gets the same treatment, regardless of where it came from.
Fig 2. Three ways in, one shared mailbox, one safety check before anything else runs.

Three ways in

Different teams have different habits. The pipeline meets people where they are:

  • An upload form — a small page on your site where a customer or staff member drags a file in. The simplest path. Good for one-off scans and customer submissions.
  • An email forward — a real email address (something like docs@yourdomain). When you get an invoice in your inbox, you forward it. The pipeline picks it up automatically.
  • A folder drop — the office scanner or a shared drive folder. Drop a file in; the pipeline takes it from there.

You don’t have to use all three. Most clients pick one and stick with it.

What gets attached at intake

The intake step doesn’t read the document yet. It just saves the file and writes a small label alongside it:

  • Where the document came from (upload form, email, folder).
  • Who sent it, if known (email sender, signed-in user).
  • What time it arrived.
  • A best-guess document type, if the source gives a hint (e.g. emailed invoices land in the “invoice” bucket by default).

That label travels with the document through every step that follows. If something goes wrong later, you can always see how it got there.

Why virus scan first

Documents arrive from outside your business. Some of them — not many, but some — arrive carrying things you don’t want. Before the AI reader spends a single cent reading a file, the system runs a quick virus scan.

Two outcomes:

  • Pass. Almost everything passes. The document moves on to the next stage.
  • Fail. The file is held in a quarantine spot, an alert goes to the operator, and nothing in the pipeline reads its contents. The original sender hears nothing automatic — that’s for the operator to decide.

What gets quarantined (and what happens then)

It’s rare. Most quarantines aren’t actual viruses — they’re corrupted scans, password-locked PDFs the scanner can’t open, or files in formats the system doesn’t support. The operator decides whether to release them, ask the sender to resend, or simply ignore.

The actual scanner is a small piece you can swap. The simplest version is an open-source scanner the pipeline runs itself. The cloud also offers a managed version that does the same job without you running anything — it’s the right call when it’s available in your region and the pay-per-scan price fits your volume.

In plain words

Three ways in, one mailbox, one safety check. Once a document is past the virus scan, the rest of the pipeline can trust what it’s reading. The label that intake attached travels with the document everywhere — so a year from now you can still trace any field on any sheet back to the exact file it came from.

All posts