Part 7 of 7 · Transcription archive series ~8 min read

Engineering reference: the transcription archive architecture

Same system, drawn for engineers. Region, service names, resource identifiers, the Amazon Transcribe job config, Bedrock model IDs, the S3 Vectors index layout, Lambda inventory, IAM scopes, the SES inbound rule set, and the DynamoDB schemas. Read alongside the previous six posts; this one’s the build sheet.

Region and account shape

Default region: ap-southeast-1 (Singapore). Amazon Transcribe, S3 Vectors, Bedrock cross-Region inference, and SES inbound are all available there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is a search that returns nothing for an hour, not a regional outage. One AWS account dedicated to the archive (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. Data residency matters here: recordings can be sensitive, so pick the region that satisfies your contracts and keep audio, transcripts, and vectors all in it.

Topology

AWS topology of the transcription archive A topology diagram with three regions stacked vertically inside one AWS account boundary. Top region: ingress. Three boxes show the three intake lanes — a Drive folder sync via the drive-sync Lambda triggered every few minutes by EventBridge Scheduler that copies new recordings to s3://tx-audio/, an SES inbound rule set with action S3 PUT to s3://tx-raw-mime/ plus the parser Lambda intake-ses-parser that extracts the audio attachment to s3://tx-audio/, and a connector Lambda triggered on a schedule by EventBridge Scheduler that pulls finished recordings from the meeting tool's cloud API into s3://tx-audio/. Middle region: pipeline processing. The S3 PUT on tx-audio triggers the transcribe Lambda, which starts an Amazon Transcribe job writing JSON to s3://tx-transcripts/; on job completion an EventBridge event triggers the filer Lambda, which writes a catalogue row to DynamoDB tx-catalogue, then the indexer Lambda chunks the transcript, calls Titan Text Embeddings V2 for each chunk, and writes vectors with metadata to the S3 Vectors index tx-vectors. Bottom region: search and access. The search-handler Lambda behind a Function URL embeds the query with Titan V2, queries the S3 Vectors index, filters returned chunks by the asker's access against each chunk's access tag, calls Claude Haiku 4.5 to compose a grounded answer with a quote and timestamp, and writes a row to DynamoDB tx-searchlog. The ack-handler equivalent here is the access-handler Lambda that records named opens of locked recordings to tx-access. CloudWatch Logs collects from every Lambda at 7-day retention. Across the right edge: a small box labelled AWS Budgets alarm at $25 monthly threshold, posting to SNS topic tx-cost-alarm. A note at the bottom: every answer is grounded in retrieved chunks — and every search is logged to tx-searchlog. Ingress Lambda · drive-sync every few min Drive API → s3://tx-audio/ new recordings SES inbound rule set tx-inbound-rules action: S3 PUT s3://tx-raw-mime/ trigger: intake-ses-parser Lambda · connector scheduled pull meeting tool API for finished recordings → s3://tx-audio/ S3 audio bucket tx-audio · PUT starts pipeline Pipeline processing Lambda · transcribe StartTranscriptionJob speaker labels on JSON → tx-transcripts + word timestamps Lambda · filer date, people, topic access tag from rules writes catalogue row to tx-catalogue Lambda · indexer chunk + Titan V2 1024-dim vectors → S3 Vectors tx-vectors (access tag on each) Search & access Lambda · search-handler Function URL, embeds query, queries tx-vectors, filters by access Bedrock · Haiku 4.5 grounded answer + direct quote + recording, time → play link Lambda · access-handler logs every search to tx-searchlog, and named opens of locked recordings to tx-access Every answer is grounded in retrieved chunks — and every search is logged.
Fig 7. AWS topology, in three regions of the diagram: ingress (three lanes into the audio bucket), pipeline processing (transcribe, file, index), search and access (the query resolves to a grounded answer and the search is logged). Every Lambda is event- or schedule-driven; nothing is synchronous-chained.

Lambda functions

All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.

  • drive-sync — EventBridge Scheduler target, fires every few minutes. Uses the Google Drive API (service-account credentials in Secrets Manager under tx/drive/sa) to list the watched folder and copy new audio/video objects to s3://tx-audio/, recording a synced-files marker so it never re-copies. The same pattern syncs the rules and access docs to s3://tx-rules-source/. Memory: 256 MB. Timeout: 60 s.
  • intake-ses-parser — S3 PUT trigger on s3://tx-raw-mime/. Parses MIME, locates the audio/video attachment (or a download link), pulls the media, and stores it in s3://tx-audio/. Large attachments are streamed, not buffered. Keeps the raw MIME for audit. Memory: 512 MB. Timeout: 120 s.
  • connector — EventBridge Scheduler target, every two hours. Calls the meeting tool’s cloud API (OAuth token in Secrets Manager under tx/meeting/oauth) for recordings completed since the last cursor, downloads new ones to s3://tx-audio/, and advances the cursor in Parameter Store. Handles the tool’s pagination and rate limits; backs off on 429. Memory: 512 MB. Timeout: 300 s.
  • transcribe — S3 PUT trigger on s3://tx-audio/. Calls StartTranscriptionJob with ShowSpeakerLabels=true, automatic language identification (or a fixed language from config), and output to s3://tx-transcripts/<recording-id>.json. Uses the batch tier for connector-sourced jobs (no latency pressure) and the standard tier for forwarded ones. Memory: 256 MB. Timeout: 30 s (the job itself runs async in Transcribe). No Bedrock calls.
  • filer — triggered by the Transcribe job-completion event on EventBridge. Reads the transcript JSON, derives the recording date from object metadata, maps speaker labels and invitee lists to people aliases from the rules doc, tags a topic via a keyword pass, and resolves the access tag from the rules defaults. Writes one row to tx-catalogue and emits tx.filed. Memory: 256 MB. Timeout: 60 s. No Bedrock calls.
  • indexer — EventBridge rule on tx.filed. Chunks the transcript (~1 paragraph, sentence-aligned, small overlap, each chunk carrying its first-word start time), drops empty/silent chunks, calls Titan Text Embeddings V2 (amazon.titan-embed-text-v2:0) per chunk for a 1024-dim vector, and writes vectors with metadata (recording_id, start_ms, people, topic, access_tag, sensitive) to the S3 Vectors index tx-vectors. Flags the catalogue row searchable when all chunks land. Memory: 512 MB. Timeout: 120 s.
  • search-handler — Lambda Function URL, AuthType: AWS_IAM fronted by your identity provider, or a signed session for the internal search UI. Embeds the query with Titan V2, queries tx-vectors (top-k with a metadata filter on sensitive=false), drops chunks whose access_tag the caller’s teams don’t cover, then calls Claude Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0 via global.anthropic.claude-haiku-4-5-20251001-v1:0) with the surviving chunks and a strict grounding prompt. Returns answer, quote, recording, and a deep link built from start_ms. Writes a tx-searchlog row. Memory: 512 MB. Timeout: 30 s.
  • access-handler — Lambda Function URL for named opens of locked (sensitive) recordings. Verifies the caller is authorized for that specific recording_id, returns the transcript or a signed audio URL, and writes a tx-access row. This is the only path that can surface a sensitive recording, and it is always logged. Memory: 256 MB. Timeout: 15 s.
  • digest — EventBridge Scheduler target, weekly Monday 9am. Reads tx-searchlog and tx-catalogue for the week; emails an admin summary via SES (new recordings filed, top searches, empty-result questions worth investigating). No Bedrock; plain summary table. Memory: 256 MB.

Storage

  • S3 · tx-audio — source recordings. Versioning enabled. Lifecycle to Glacier Instant Retrieval at 60 days; no auto-expiry by default (recordings are the record). SSE-KMS with a dedicated key.
  • S3 · tx-transcripts — Transcribe JSON output, kept so the archive can be re-indexed without re-transcribing. Versioning enabled. SSE-KMS.
  • S3 · tx-raw-mime — raw inbound MIME from forwarded recordings, for provenance. Lifecycle to Glacier at 30 days; expiry at 7 years.
  • S3 · tx-rules-source — mirrored rules and access docs as plain text. Versioning enabled.
  • S3 Vectors · tx-vectors — the searchable index. 1024-dim vectors from Titan V2, one per kept chunk. Metadata per vector: recording_id, start_ms, people, topic, access_tag, sensitive. Queried top-k with a metadata pre-filter.
  • DynamoDB · tx-catalogue — one row per recording. PK recording_id; attributes: title, date, people, topic, access_tag, sensitive, transcript_key, audio_key, indexed. On-demand. GSI on date for browse.
  • DynamoDB · tx-searchlog — one row per query. PK (user_id, ts); attributes: query, returned_ids, result_count, latency_ms. On-demand. No TTL — this is the long-term audit trail.
  • DynamoDB · tx-access — one row per named open of a locked recording. PK (recording_id, ts); attributes: user_id, reason, granted_by. On-demand. No TTL.

Amazon Transcribe

  • Job config. StartTranscriptionJob with ShowSpeakerLabels=true and MaxSpeakerLabels tuned to room size; IdentifyLanguage=true unless a fixed language is set in config. Output to s3://tx-transcripts/. Custom vocabulary (product names, people, acronyms) raises accuracy on domain terms.
  • Tiering. Connector-sourced jobs use the batch path (no latency pressure); forwarded recordings use standard. PII redaction can be enabled per access tag so transcripts of sensitive recordings store redacted text by default.
  • Completion. Transcribe emits a job-state-change event to EventBridge; the filer Lambda triggers on COMPLETED and on FAILED writes the recording to a dead-letter prefix and alerts.

Bedrock

  • Embeddings. amazon.titan-embed-text-v2:0, 1024-dim, normalized. Two callsites: the indexer (one call per chunk at index time) and the search-handler (one call per query). The query and the chunks must use the same model and dimension.
  • Foundation model. anthropic.claude-haiku-4-5-20251001-v1:0 via the Global cross-Region inference profile global.anthropic.claude-haiku-4-5-20251001-v1:0. One callsite: the search-handler, composing the grounded answer. Sonnet 4.6 is not used — the answer is a short, grounded summary of a few chunks, well within Haiku’s range, and the cost difference matters at search volume.
  • Quotas. Default account quotas are more than enough at SMB volume. The expensive work is Transcribe, not Bedrock.

EventBridge and Scheduler config

  • tx-drive-syncrate(5 minutes). Target: drive-sync Lambda.
  • tx-connectorrate(2 hours). Target: connector Lambda.
  • tx-weekly-digestcron(0 9 ? * MON *) in TZ. Target: digest Lambda.
  • Transcribe completion rule — EventBridge rule on aws.transcribe Job State Change → target filer Lambda.
  • tx.filed rule — custom-bus rule on the filer’s emitted event → target indexer Lambda.

SES inbound and outbound

  • Set the MX record on a dedicated subdomain (e.g. archive.your-company.com) to inbound-smtp.ap-southeast-1.amazonaws.com.
  • SES inbound rule set tx-inbound-rules: one rule with recipient archive@your-company.com → spam scan → S3 PUT to s3://tx-raw-mime/<message-id> → stop. The S3 PUT triggers intake-ses-parser.
  • SES outbound for the weekly digest: verify a sender identity at archive-bot@your-company.com with DKIM and SPF on the parent domain. Out of sandbox by request.

IAM (least privilege per Lambda)

Each Lambda has its own role with policies scoped to exact ARNs. Sketch:

  • transcribe role: s3:GetObject on tx-audio; transcribe:StartTranscriptionJob; s3:PutObject on tx-transcripts; kms:Decrypt + GenerateDataKey on the archive key. No bedrock:*.
  • filer role: s3:GetObject on tx-transcripts and tx-rules-source; dynamodb:PutItem on tx-catalogue; events:PutEvents on the custom bus. No bedrock:*.
  • indexer role: s3:GetObject on tx-transcripts; bedrock:InvokeModel on the Titan ARN; s3vectors:PutVectors on tx-vectors; dynamodb:UpdateItem on tx-catalogue (the indexed flag).
  • search-handler role: bedrock:InvokeModel on the Titan ARN and the Haiku ARN; s3vectors:QueryVectors on tx-vectors; dynamodb:PutItem on tx-searchlog; dynamodb:GetItem on tx-catalogue. No write access to audio, transcripts, or vectors.
  • access-handler role: dynamodb:PutItem on tx-access; dynamodb:GetItem on tx-catalogue; s3:GetObject + presign on tx-audio and tx-transcripts scoped per-request to the opened recording_id.
  • drive-sync / connector / intake-ses-parser roles: secretsmanager:GetSecretValue on the relevant secret; s3:PutObject on tx-audio (and tx-rules-source for drive-sync); outbound network to the Google or meeting-tool API only.

Search surface

The search box is a small static page that posts the query and the caller’s identity token to the search-handler Function URL. Identity comes from your existing SSO (the Function URL is fronted by IAM auth or a short-lived signed session); the archive doesn’t run its own user store. The response is rendered as a short answer, the quote in a blockquote, the recording title and date, and a play button whose link carries the start_ms so the audio element seeks straight to the moment. Locked recordings never appear here; opening one is a separate, authorized action through access-handler.

Observability and cost gates

  • CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on "error" + "throttle" + "timeout" to a CloudWatch metric for alerting.
  • Alarms: Transcribe job failures > 0 in a day; indexer failures > 0 (an un-indexed recording is invisible to search); search-handler p95 latency > 4s; access-handler authorization failures > 5/hour.
  • X-Ray: on for the search-handler only (the user-facing path); off elsewhere to save cost.
  • AWS Budgets: $25/month threshold, alarm at 80% and 100%, posts to SNS topic tx-cost-alarm subscribed to the on-call admin’s email.

Config and secrets

Service-account credentials for the Drive API live in Secrets Manager under tx/drive/sa; the meeting-tool OAuth token under tx/meeting/oauth. The connector cursor, the chunk-size and overlap settings, the topic and people-alias tables, the access defaults, and the SES sender identity all live in Parameter Store under /tx/config/ (the larger tables as JSON in tx-rules-source, mirrored from Drive). The KMS key id for the archive and the index region are also config. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.

Deploy

GitHub Actions with OIDC into a deploy role — no long-lived keys — running AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for tx-audio, tx-transcripts, and tx-rules-source so a bad sync or edit can be rolled back, and keep the KMS key and the S3 Vectors index in the same region as the audio for data-residency reasons. Total deployable surface: around nine Lambdas, three DDB tables, one S3 Vectors index, four S3 buckets, a handful of EventBridge rules, one SES rule set, and one Budgets alarm.

That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.

All posts