Part 3 of 7 · FAQ builder series ~5 min read

How repeat questions get grouped

Once a day, an EventBridge Scheduler rule fires the grouper Lambda. It reads the new questions added since the last pass, looks at one at a time, finds the questions already in the pile that mean the same thing, and either joins an existing group or starts a new one. Then it decides which groups have been asked enough to deserve a FAQ entry. The whole pass is plain Python over the vectors. No model writes anything here — the only AI involved is the embedding from Part 2.

Key takeaways

  • The grouper runs once a day via EventBridge Scheduler.
  • Each new question is matched against existing clusters by vector nearness — same meaning, not same words.
  • A near-match joins a cluster; nothing close enough starts a new one.
  • A cluster that crosses the repeat threshold (default five asks, not already covered) becomes a candidate.
  • The grouper writes no answer — it only decides what is worth answering.

The grouping flow, per question

Grouping flow per question on every daily pass A vertical decision flow diagram. At the top, an input box "New question from pile" with the cleaned text, its vector, source tag, and seen date. Below that, a step "Search nearest vectors" — query S3 Vectors for the closest existing questions. Below that, a check "Already covered by a live entry?" — if yes, route to "Skip" (count it, but no new candidate). If no, continue. The next step "Join or start a cluster" — if a near cluster exists within the join threshold, attach this question and bump its times-seen counter; if not, start a new single-question cluster. The next check "New cluster, or covered?" — a brand-new single-question cluster with nothing near it routes to "Skip" (counted but not yet worth a candidate); a cluster whose question is already covered by a live entry but keeps getting asked routes to "Refresh" (the entry may need updating). Otherwise continue. The final check "Crossed the repeat threshold?" — compare the cluster's times-seen over the window against the configured minimum (default five), with the priority flag from a manual entry as an override. If below the threshold, route to "Wait" (keep it warm for a future pass); if it crosses, route to "Candidate" (mark the cluster ready for the drafter). Each terminal box — Skip, Wait, Candidate, Refresh — writes the cluster state to DynamoDB. A note at the bottom: the threshold lives in the rules doc — change the minimum-asks number and tomorrow's pass uses it. New question from pile text · vector · source · seen date Step 1 Search nearest vectors query S3 Vectors Step 2 Already covered live? check fb-clusters table Step 3 Join or start a cluster near match? attach : new Step 4 New cluster, or covered? no cluster → skip covered → refresh Step 5 Crossed repeat threshold? below → wait, else candidate Skip count, no candidate Wait keep cluster warm Candidate ready to draft Refresh entry may be stale if yes none covered below crossed The threshold lives in the rules doc — change the minimum-asks and tomorrow’s pass uses it.
Fig 3. The grouper’s decision tree, per question, per daily pass. Five steps decide whether a question is skipped, kept warm, or turned into a candidate (or marks a live entry as needing a refresh). The rules doc holds the threshold; the grouper only enforces it.

Same meaning, not the same words

The whole reason questions are embedded in Part 2 is so the grouper can match them by meaning. “Do you ship to Canada?”, “Can you deliver to Toronto?”, and “Is international shipping available north of the border?” share almost no words, but they’re the same question. Their vectors sit close together, so a nearest-neighbor search in S3 Vectors finds them as a group. A keyword match would put them in three different buckets and you’d never see that it’s one popular question asked three ways.

The grouper queries S3 Vectors for the closest existing questions to each new one. If the closest is within the join threshold (a distance the rules doc sets, with a sensible default), the new question joins that cluster and bumps its times-seen counter. If nothing is close enough, the question starts a new single-question cluster of its own. Over days, popular questions accrete into big clusters and rare one-offs stay as singletons.

Four outcomes, every pass

Every new question, every pass, lands in exactly one of four buckets. The names are plain on purpose.

  • Skip. The question is already covered by a live FAQ entry. Count it — knowing a covered question is still being asked is useful — but don’t make a new candidate. Most questions, once the FAQ matures, land here.
  • Wait. The question joined a cluster, but the cluster hasn’t been asked enough times yet. Keep it warm. A cluster sitting at three asks this month is one good week away from becoming a candidate; the grouper remembers it so nothing has to start over.
  • Candidate. The cluster just crossed the repeat threshold and isn’t covered. Mark it ready for the drafter in Part 4. This is the whole point of the pass — turning “people keep asking this” into “let’s answer it once, well.”
  • Refresh. The cluster is covered by a live entry, but the asks keep coming — which often means the published answer is unclear, incomplete, or out of date. Flag the entry for a refresh so a reviewer can decide whether the answer needs a rewrite.

The threshold is a number you own

How many asks make a FAQ entry? That’s a judgment call, and it lives in the rules doc as a plain number — min_asks_for_candidate, default five. Set it lower if you want a thorough FAQ that captures the long tail; set it higher if you only want the genuine top questions. The grouper reads the number each pass, so changing it doesn’t need a deploy. The priority flag from a manual entry (Part 2) is the one override: a rep can mark a question important and skip the counting entirely.

There’s also a time window. The count that matters is “asked N times in the last 30 days,” not “asked N times ever,” so a question that was hot last year but nobody asks now doesn’t keep nagging to be answered. The window length is configurable too.

Why the grouping uses no model

The grouper could ask a model “are these two questions the same?” on every pair. It doesn’t. Two reasons. First, the embedding already captured the meaning in Part 2 — a nearest-neighbor search over those vectors does the same job for a fraction of a cent, and it’s consistent: the same two questions always land the same distance apart. Second, a model in this loop would cost money on every question on every pass, most of which just join an obvious cluster or get skipped. The model earns its place in Part 4, drafting the answer for a cluster that’s already been judged worth answering — not here, sorting questions into bins.

Next post: how the drafter takes a candidate cluster, pulls the matching passages from your help docs, and writes a short answer that cites its source — or admits when it can’t.

All posts