How a conversation starts and stays alive

Three moments in a conversation

Fig 2. Connect, exchange, idle. A websocket and a small scratchpad — that’s the whole shape.

Why a websocket and not a plain request

A chat widget is the one place on a website where a visitor expects words to appear in real time, the way a human typing would. Plain request/response works for a single question, but it falls apart on the second turn — the visitor sees a frozen UI while the cloud thinks, and follow-up questions feel laggy. A websocket makes the connection feel alive: messages stream back word by word as they’re generated, and the visitor can type again without waiting for a full reply to arrive.

It also keeps the per-message overhead low. Once the connection is open, each turn is a few bytes over the wire instead of a fresh TLS handshake plus authentication. That’s the difference between a 200ms chat and a 1.5s chat — a difference visitors notice without being able to name.

The session scratchpad

Every conversation has a small scratchpad — a short list of recent turns that the answerer reads before each reply. It exists so a visitor doesn’t have to re-establish context on every message: if turn one was “do you ship to Canada?” and turn two is “how long does it take?”, the scratchpad lets the answerer connect “it” to “shipping to Canada.”

The rules for the scratchpad are deliberately strict:

Short. Only the last few turns. Long histories make replies slower, more expensive, and weirdly off-topic as the AI reads things from earlier in the conversation that aren’t relevant anymore.
Trimmed automatically. Once the scratchpad reaches its limit, the oldest turn falls off. No manual cleanup, no growth without bound.
Session-scoped. The scratchpad belongs to one websocket session. When the session ends, it’s gone. The next time the visitor opens the widget, it’s a fresh start.
Expires quickly. Even before the session ends, idle scratchpads time out. Nothing the visitor typed lingers in storage longer than the conversation lasted.

The reason for all this restraint is that “memory” in a chat assistant is the source of half its problems. Long-lived memory invites privacy questions (what did the assistant store about that visitor?), surprise behaviours (the assistant brings up something from three weeks ago), and expensive prompts (every turn pays for a long history). Short-term memory, scoped to a session, sidesteps all of that. If you want long-lived memory — for a logged-in customer who wants their order history available — that’s a separate, opt-in feature, layered on top.

What happens at the edges

Three edge cases are worth designing for from the start, because they’re common and silent if mishandled:

The visitor refreshes the page. The websocket drops; the widget reopens it on the new page; the new connection gets a fresh scratchpad. Treating “same visitor” across page loads adds complexity that almost no SMB needs.
The visitor opens two tabs. Each tab gets its own session. They don’t share a scratchpad. This is the simplest behaviour and the one visitors expect — if they want to compare two threads, they expect them to be independent.
The connection drops mid-reply. The cloud finishes generating the reply, stores it on the (now-ended) session, and on next reconnect the widget shows it as the last turn. The visitor sees their answer the moment they come back online.

How this plugs into the next post

Everything in this post is plumbing — how a turn arrives at the answerer with the right context attached. The next post is about what the answerer actually does: how it searches your knowledge, how it requires a citation, and how it picks one of four moves on every visitor turn. The session and scratchpad are what let it focus on the current message without losing the thread.

All posts