How repeat questions get grouped
Once a day, an EventBridge Scheduler rule fires the grouper Lambda. It reads the new questions added since the last pass, looks at one at a time, finds the questions already in the pile that mean the same thing, and either joins an existing group or starts a new one. Then it decides which groups have been asked enough to deserve a FAQ entry. The whole pass is plain Python over the vectors. No model writes anything here — the only AI involved is the embedding from Part 2.
Key takeaways
- The grouper runs once a day via EventBridge Scheduler.
- Each new question is matched against existing clusters by vector nearness — same meaning, not same words.
- A near-match joins a cluster; nothing close enough starts a new one.
- A cluster that crosses the repeat threshold (default five asks, not already covered) becomes a candidate.
- The grouper writes no answer — it only decides what is worth answering.
The grouping flow, per question
Same meaning, not the same words
The whole reason questions are embedded in Part 2 is so the grouper can match them by meaning. “Do you ship to Canada?”, “Can you deliver to Toronto?”, and “Is international shipping available north of the border?” share almost no words, but they’re the same question. Their vectors sit close together, so a nearest-neighbor search in S3 Vectors finds them as a group. A keyword match would put them in three different buckets and you’d never see that it’s one popular question asked three ways.
The grouper queries S3 Vectors for the closest existing questions to each new one. If the closest is within the join threshold (a distance the rules doc sets, with a sensible default), the new question joins that cluster and bumps its times-seen counter. If nothing is close enough, the question starts a new single-question cluster of its own. Over days, popular questions accrete into big clusters and rare one-offs stay as singletons.
Four outcomes, every pass
Every new question, every pass, lands in exactly one of four buckets. The names are plain on purpose.
- Skip. The question is already covered by a live FAQ entry. Count it — knowing a covered question is still being asked is useful — but don’t make a new candidate. Most questions, once the FAQ matures, land here.
- Wait. The question joined a cluster, but the cluster hasn’t been asked enough times yet. Keep it warm. A cluster sitting at three asks this month is one good week away from becoming a candidate; the grouper remembers it so nothing has to start over.
- Candidate. The cluster just crossed the repeat threshold and isn’t covered. Mark it ready for the drafter in Part 4. This is the whole point of the pass — turning “people keep asking this” into “let’s answer it once, well.”
- Refresh. The cluster is covered by a live entry, but the asks keep coming — which often means the published answer is unclear, incomplete, or out of date. Flag the entry for a refresh so a reviewer can decide whether the answer needs a rewrite.
The threshold is a number you own
How many asks make a FAQ entry? That’s a judgment call, and it lives in the rules doc as a plain number — min_asks_for_candidate, default five. Set it lower if you want a thorough FAQ that captures the long tail; set it higher if you only want the genuine top questions. The grouper reads the number each pass, so changing it doesn’t need a deploy. The priority flag from a manual entry (Part 2) is the one override: a rep can mark a question important and skip the counting entirely.
There’s also a time window. The count that matters is “asked N times in the last 30 days,” not “asked N times ever,” so a question that was hot last year but nobody asks now doesn’t keep nagging to be answered. The window length is configurable too.
Why the grouping uses no model
The grouper could ask a model “are these two questions the same?” on every pair. It doesn’t. Two reasons. First, the embedding already captured the meaning in Part 2 — a nearest-neighbor search over those vectors does the same job for a fraction of a cent, and it’s consistent: the same two questions always land the same distance apart. Second, a model in this loop would cost money on every question on every pass, most of which just join an obvious cluster or get skipped. The model earns its place in Part 4, drafting the answer for a cluster that’s already been judged worth answering — not here, sorting questions into bins.
Next post: how the drafter takes a candidate cluster, pulls the matching passages from your help docs, and writes a short answer that cites its source — or admits when it can’t.
All posts