Part 2 of 7 · FAQ builder series ~4 min read

How the question pile gets built

The builder can only find repeats in questions it has seen. So the first job is feeding it the questions your customers actually ask — not the ones you think they ask. There are three ways a question gets into the pile: support email forwarded to a dedicated address, chat transcripts dropped in a Drive folder, or a rep typing in a question by hand. The first two are where most of the volume comes from. The third is for the question a rep knows is common but hasn’t been captured yet.

Key takeaways

  • Three intake lanes feed one pile: a support inbox, a chat export, and a manual lane.
  • Each question is cleaned (signatures, greetings, and order numbers stripped) so it stands on its own.
  • The cleaned question is embedded with Titan Text Embeddings V2 and written to S3 Vectors.
  • Personal details are removed before anything is stored — the pile holds questions, not customer data.
  • The pile is the one place the grouper looks. The lanes are just ways of filling it.

Three lanes into one pile

Three intake lanes funnel into one question pile A diagram with three vertical lane columns at the top and a single unified row at the bottom. Lane one, Support inbox: support email is forwarded or auto-routed to a dedicated address; SES writes the raw MIME to S3; an intake Lambda reads the question out of the email body, strips the signature and greeting, removes personal details, and embeds the cleaned question with Titan Text Embeddings V2. Lane two, Chat export: chat transcripts are dropped in a Drive folder; a sync Lambda mirrors them to S3; the intake Lambda splits each transcript into the customer's actual questions, cleans them the same way, and embeds each one. Lane three, Manual: a rep types a question into a small Slack form or a row in the Drive sheet; it skips parsing and goes straight to cleaning and embedding. All three lanes converge on the same question pile, stored as vectors in S3 Vectors with the cleaned text and a source tag in DynamoDB. A note at the bottom: personal details are stripped before anything is stored — the pile holds questions, not customer data. Lane 1 · SES inbound Support inbox • Email routed to support address • SES writes MIME to S3 • Intake reads the question, cleans it • Embeds with Titan V2 Lane 2 · Drive sync Chat export • Drop transcripts in a Drive folder • Sync mirrors them to S3 • Intake splits out each question • Cleans + embeds each one Lane 3 · by hand Manual • Rep types a question in Slack • Skips parsing, goes to cleaning • Same embed step as the other lanes • For known but uncaptured asks Question pile (S3 Vectors + DynamoDB index) cleaned text · vector · source tag · first-seen date · times-seen counter grouper reads the new vectors on its daily pass to grouper, daily pass Personal details are stripped before anything is stored — the pile holds questions, not customer data.
Fig 2. Three lanes converge on one question pile. The pile holds the cleaned question text and its vector; the inbox, chat, and manual lanes are just three ways of filling it. The grouper reads the new vectors once a day.

Lane 1: the support inbox (most of the volume)

Set up a dedicated inbound address — something like questions@your-company.com — via Amazon SES, and forward or auto-route your support email there. (If you already run support out of a shared inbox, a single forwarding rule does it.) SES writes the raw MIME to s3://fb-raw-mime/. The S3 PUT triggers an intake Lambda that walks the MIME to the message body and pulls out the actual question.

Most support emails are mostly noise: a greeting, the question, a thank-you, a signature, a legal footer. The intake strips all of that down to the question itself, and removes anything personal — names, order numbers, account IDs, email addresses — because the pile is about what people ask, not who asked. The cleaned question is then embedded with Titan Text Embeddings V2 and written to the pile. If a single email contains two unrelated questions, the intake splits them so each can be grouped on its own.

Lane 2: chat export

If you run a live chat or a help widget, you already have a record of what people type in. Most chat tools can export transcripts on a schedule or let you drop them in a folder. Point that at a Drive folder the builder watches. A drive-sync Lambda mirrors the folder to S3 every fifteen minutes; new transcripts trigger the same intake Lambda.

Chat is messier than email — a transcript is a back-and-forth, not a single question — so the intake does a little more work here: it picks out the customer’s turns, keeps the ones that are actually questions, and drops the small talk. Each surviving question is cleaned and embedded exactly like the inbox lane. From the pile’s point of view, a question from chat and a question from email look identical; only the source tag differs.

Lane 3: manual entry

Sometimes a rep just knows. “Everyone asks whether the warranty covers water damage, and it’s not written down anywhere.” They don’t need to wait for five emails to prove it. Lane 3 is a small Slack form (or a row in the Drive sheet) where a rep types the question directly. It skips the parsing step — there’s no email or transcript to read — and goes straight to cleaning and embedding.

A manually entered question can be marked “priority,” which tells the grouper in Part 3 to treat it as a candidate even if it hasn’t been asked five times yet. That’s the escape hatch for the obviously-common question that just hasn’t shown up in the data yet.

Why everything funnels into one pile

Three lanes in, but only one place the grouper looks. That’s deliberate. If chat questions and email questions lived in separate stores, “how often do people ask this?” would mean counting across two places and hoping the dedup worked. Funneling everything into one pile of cleaned, embedded questions means there is exactly one count per question, regardless of where it came from. The lanes are first-class for getting questions in, but they always pass through the same cleaning and embedding on the way.

Next post: how the grouper reads the pile, finds the questions that keep repeating, and decides which clusters earn a FAQ entry.

All posts