How replies work without making things up

Key takeaways

The Meta webhook is acked in milliseconds; the slow RAG work happens off the critical path on SQS.
A confidence gate refuses to answer when the top vector match scores below threshold — no answer beats a guessed answer.
Claude Haiku 4.5 is instructed to answer using ONLY retrieved chunks and to return the IDs of the chunks it used.
The citation guardrail blocks any reply that didn’t cite a chunk — the structural guarantee against hallucinated prices.
Every new client starts in draft mode — replies route to your inbox until you trust each category, then you flip them to auto-send.

AI reply bots have a bad reputation, and they’ve earned it. They quote prices that don’t exist, promise features that aren’t real, give medical or financial advice they shouldn’t. The fix isn’t to make the model bigger — it’s to take away its freedom to make things up.

Fig 5. Two guardrails: a confidence gate before the AI runs, a citation check before the reply ships.

Why webhook then queue

Facebook’s webhook expects a quick “got it” within a couple of seconds, or it’ll keep retrying the same message. So the receiver does the bare minimum — checks the message is real, drops it into a queue, and answers Facebook right away. The slow work (looking things up, calling the AI) happens after, off the critical path. If the AI happens to be slow that day, Facebook still gets its “got it” on time. And if a message fails partway through, it lands in a separate “something went wrong” queue you can look at later, instead of vanishing.

Why a confidence gate

The vector search returns the top relevant chunks ranked by similarity. If the top match is weak — below 0.6, say — that’s the system telling you “I don’t actually know this.” Forcing the model to answer anyway is exactly how hallucinations happen.

So the confidence gate is brutal: low score, the message lands in your inbox as an alert and gets logged in an “unanswered” list. That list becomes the next batch of FAQ entries the client should add to their Drive doc. The bot improves over time without anyone training a model.

Why a citation guardrail

Even when given the right context, models sometimes free-style anyway. So the instructions to the model are blunt:

Answer using ONLY the provided context.
Return both the reply text and the IDs of the chunks you actually used.

The Reply Lambda checks the response. No cited chunk = no send. The reply is logged but never reaches Facebook. This is the structural guarantee that the bot can’t invent things — if the model didn’t ground its answer in actual content from the client’s docs, the reply doesn’t go out.

Same denylist as posts

The reply pipeline reuses the page’s denylist from the posting pipeline. A reply on the SFC page can’t mention “XAUUSD,” even by accident. A reply on DailyScalper can’t talk about theology. The same per-page rules govern both directions of the conversation.

Start in draft mode

For every new client, replies start in draft mode by default: instead of going to Facebook, they go to your inbox. You review them in batches for a week or two. Once you trust the system on a category — say, pricing questions — you flip that category to auto-send. Broader categories like general FAQ come later.

The cost is a slower rollout. The benefit is that no theological or financial mishap reaches a real user before you’ve seen what the bot would have said.

All posts