Part 3 of 7 · Transcription archive series ~5 min read

How a transcript becomes searchable

A transcript is a wall of text. Searching it for exact words misses the moment somebody said the same thing a different way — “we need it before year-end” won’t match a search for “deadline.” To fix that, the archive turns each transcript into something that can be matched by meaning. It splits the transcript into short timed chunks, turns each chunk into a vector, and writes the vectors to S3 Vectors. The chunking is plain Python. The embeddings are the only model call. After this step, the recording is searchable.

Key takeaways

  • A vector is a list of numbers that captures meaning; two ways of saying the same thing land near each other.
  • Each transcript is split into short chunks, every chunk carrying its start time in the recording.
  • Titan Text Embeddings V2 turns each chunk into a 1024-number vector — the one model call in this step.
  • Vectors land in S3 Vectors, each tagged with its recording, its timestamp, and its access tag.
  • Indexing runs once per recording. Search later is cheap because the hard work is already done.

The indexing flow, per recording

Indexing flow per recording after it is filed A vertical flow diagram. At the top, an input box "Filed transcript" with the recording's title, date, people, and access tag plus the full text with per-word timestamps. Below that, a step "Split into timed chunks" — the text is cut into short passages of roughly a paragraph each, and every chunk keeps the start time of its first word. Below that, a check "Chunk too small or silent?" — very short or empty chunks (long pauses, hold music) are dropped so they don't waste an embedding; if dropped, route to "Skip chunk." If kept, continue. The next step "Embed with Titan V2" — each kept chunk is sent to Titan Text Embeddings V2, which returns a 1024-number vector capturing the chunk's meaning. The next step "Attach metadata" — each vector is stamped with its recording id, its start time, the people, the topic, and the access tag, so search can filter and link back later. Then the terminal outcomes: Skip chunk (dropped, no vector written), Write vector (the kept chunk's vector and metadata go to S3 Vectors), Mark indexed (the catalogue row in DynamoDB is flagged searchable once all chunks are written), and Re-embed (if the embedding model version changes, the recording can be re-indexed from its stored transcript without re-transcribing). A note at the bottom: indexing runs once per recording; search later is cheap because the hard work is already done. Filed transcript title · date · people · text Step 1 Split into timed chunks ~1 paragraph · keep start time Step 2 Too small or silent? drop pauses, hold music Step 3 Embed with Titan V2 chunk → 1024-number vector Step 4 Attach metadata recording id · start time people · topic · access tag Step 5 All chunks written? then flag the catalogue row Skip chunk no vector Write vector to S3 Vectors Mark indexed row now searchable Re-embed if model changes if drop empty later each when done Indexing runs once per recording — search later is cheap because the hard work is done.
Fig 3. The indexing flow, per recording. Split into timed chunks, drop the empty ones, embed each with Titan V2, attach metadata, and write to S3 Vectors. Once every chunk is written, the catalogue row is flagged searchable.

Chunking: short, timed, and a little overlapping

A whole transcript is too big to embed as one vector — the meaning of an hour-long call doesn’t fit in a single list of numbers. So the transcript is cut into short passages, roughly a paragraph each, on natural speaker turns and sentence boundaries. Every chunk keeps the start time of its first word, because that timestamp is what lets search later jump straight to the moment in the audio.

Chunks overlap a little — each one carries the last sentence or two of the chunk before it. That overlap means a thought that spans a chunk boundary (“...and the budget, which by the way, is firm at fifty thousand”) is still captured whole in at least one chunk, instead of being split in a way that hides it from search. The chunk size and overlap are plain settings in the rules doc; the defaults work for normal meetings.

Very short chunks and silent stretches — long pauses, hold music, the thirty seconds before everyone joined — are dropped before embedding. There’s no point spending a model call to index “[silence].”

Embedding: turning words into numbers that mean something

Each kept chunk is sent to Titan Text Embeddings V2, which returns a vector — a list of 1024 numbers. The trick of embeddings is that the numbers capture meaning: two chunks that say the same thing in different words land close together in that 1024-dimensional space, and two chunks about different things land far apart. That’s what lets a search for “deadline” find a chunk that only ever said “before year-end.” The same model will turn the search question into a vector later, so questions and chunks are measured on the same ruler.

Embedding is the one model call in this step, and it’s a cheap one — a fraction of a cent per chunk. A typical hour-long meeting is a few dozen chunks, so indexing a recording costs a cent or two. And it happens exactly once. After that, the recording can be searched as many times as you like for almost nothing.

Writing to S3 Vectors

The vectors go into S3 Vectors, AWS’s built-in vector store. Each vector is written with metadata attached: the recording id, the chunk’s start time, the people, the topic, and — importantly — the access tag. That access tag is what lets search filter results by who’s allowed to see them before any answer is written, which Part 5 covers in detail. Storing the access tag on the vector means the filter happens at the search itself, not as an afterthought.

S3 Vectors is the right fit here because the archive’s search volume is bursty and low — a handful of searches a day, not thousands a second. You pay for the storage and the searches you actually run, with no always-on index server humming in the background. For an SMB archive that mostly sits quiet and occasionally gets a question, that economics is exactly right.

Re-indexing without re-transcribing

The full transcript is kept in S3 even after indexing. That matters for one reason: if the embedding model is upgraded down the line, or you change the chunk size, you can re-index the whole archive straight from the stored transcripts — no need to re-transcribe, which is the expensive part. Transcription happens once per recording, ever; indexing can be redone cheaply whenever it’s worth it.

Next post: how a plain-language question turns into a vector, matches the closest chunks, gets filtered by access, and comes back as a short answer with a direct quote linked to the exact second.

All posts