How a comment gets a verdict

Key takeaways

Only borderline items reach the model — the rule pass already settled the rest.
House rules live in the Drive doc; a rep can edit them without a deploy.
The model returns a verdict, a confidence score, and the exact rule the content may break.
Low confidence never auto-acts — it routes the item to a human instead.
The model never deletes anything. Its strongest call is “hold for review.”

The decision flow, per item

Fig 3. The checker’s decision tree, per borderline item. A trusted author short-circuits to pass; everyone else gets a model verdict with confidence and a rule. Low confidence routes to a human. The rules doc holds every rule; the model only applies them.

House rules: “no spam, no attacks” isn’t magic, it’s in the doc

The rules doc has one short section per area of your site. Each section names the rules in plain prose: “Comments: no promotional links from unknown sites, no personal attacks on other members, no off-topic reselling. Reviews: opinions are fine even if harsh, but no naming staff, no made-up claims. Posts: no recruiting, no spam.” Each rule says what it means and whether breaking it is a hold (clear) or a send-to-human (judgment call). The model is handed this doc verbatim; it doesn’t carry its own idea of what your community allows.

Rules differ by area for a reason. A harsh review is a customer being honest — you want it. The same words aimed at another member under a post are an attack — you don’t. The doc lets you say so, in your own words, and a rep can change the wording any time without a developer. Each section also sets a confidence threshold — how sure the model has to be before its call is acted on automatically rather than sent to a person.

What the model actually returns

The prompt to Bedrock Haiku 4.5 is short and strict: “Here is a comment, here are the house rules for this area, and here are a few past calls a human corrected. Return JSON only. Give a verdict of pass, hold, or send-to-human; a confidence score from 0 to 1; and the exact rule the content may break, quoted from the doc. Do not invent a rule that isn’t there. If you are unsure, return send-to-human.”

That last line matters. The model is told that “I’m not sure” is a perfectly good answer, and unsure means a person looks. There is no pressure to force a confident call. The confidence score is then checked against the per-area threshold: a confident pass publishes, a confident hold queues the item, and anything below the line goes to a human regardless of which way it leaned. The model citing the exact rule is what makes the review card in Part 4 useful — the moderator sees not just “held” but “held because of this rule.”

The calls, and why “hold + notify” exists

Most items land in one of three calls: pass (publish), hold (a clear break — keep it out of view and add it to the review queue for the next batch), or send to a human (borderline — queue it with the reason). There’s a fourth, rarer terminal for severity: hold + notify. A clear, severe break — a credible threat, a doxxing post, hate speech — is still held rather than deleted, but a moderator is paged at once instead of waiting for the next batched digest. Holding the item keeps the decision reversible; the immediate page just gets a human looking sooner. Even here, no content comes down without a person.

Why the model doesn’t read everything

The checker could send every item to the model and skip the rule pass entirely. It doesn’t, for two reasons. First, the rule pass is free and instant, and most items are clearly safe or clearly spam — spending a model call on them is waste. Second, the items a simple rule can settle are exactly the items where a model adds nothing; the model earns its place only on the genuinely ambiguous middle. Keeping the cheap check in front means the bill stays a couple of dollars a month even as the page gets busy.

Next post: how a held or flagged item actually reaches a moderator — who gets which area, how quiet hours and batching keep the pings sane, and the four guardrails on every review card.

All posts