Engineering reference: the transcription archive architecture
Same system, drawn for engineers. Region, service names, resource identifiers, the Amazon Transcribe job config, Bedrock model IDs, the S3 Vectors index layout, Lambda inventory, IAM scopes, the SES inbound rule set, and the DynamoDB schemas. Read alongside the previous six posts; this one’s the build sheet.
Region and account shape
Default region: ap-southeast-1 (Singapore). Amazon Transcribe, S3 Vectors, Bedrock cross-Region inference, and SES inbound are all available there. A second region for multi-region resilience isn’t worth the extra setup work at SMB volume — the failure mode for an SMB is a search that returns nothing for an hour, not a regional outage. One AWS account dedicated to the archive (separate from your other workloads) keeps the IAM blast radius small and lets a single AWS Budgets alarm cover the whole system. Data residency matters here: recordings can be sensitive, so pick the region that satisfies your contracts and keep audio, transcripts, and vectors all in it.
Topology
Lambda functions
All Lambdas use the arm64 architecture, the smallest memory size that meets latency targets (typically 256 MB), Python 3.14 runtime, and CloudWatch Logs at 7-day retention. Each function has its own least-privilege IAM role. None run inside a VPC.
drive-sync— EventBridge Scheduler target, fires every few minutes. Uses the Google Drive API (service-account credentials in Secrets Manager undertx/drive/sa) to list the watched folder and copy new audio/video objects tos3://tx-audio/, recording a synced-files marker so it never re-copies. The same pattern syncs the rules and access docs tos3://tx-rules-source/. Memory: 256 MB. Timeout: 60 s.intake-ses-parser— S3 PUT trigger ons3://tx-raw-mime/. Parses MIME, locates the audio/video attachment (or a download link), pulls the media, and stores it ins3://tx-audio/. Large attachments are streamed, not buffered. Keeps the raw MIME for audit. Memory: 512 MB. Timeout: 120 s.connector— EventBridge Scheduler target, every two hours. Calls the meeting tool’s cloud API (OAuth token in Secrets Manager undertx/meeting/oauth) for recordings completed since the last cursor, downloads new ones tos3://tx-audio/, and advances the cursor in Parameter Store. Handles the tool’s pagination and rate limits; backs off on 429. Memory: 512 MB. Timeout: 300 s.transcribe— S3 PUT trigger ons3://tx-audio/. CallsStartTranscriptionJobwithShowSpeakerLabels=true, automatic language identification (or a fixed language from config), and output tos3://tx-transcripts/<recording-id>.json. Uses the batch tier for connector-sourced jobs (no latency pressure) and the standard tier for forwarded ones. Memory: 256 MB. Timeout: 30 s (the job itself runs async in Transcribe). No Bedrock calls.filer— triggered by the Transcribe job-completion event on EventBridge. Reads the transcript JSON, derives the recording date from object metadata, maps speaker labels and invitee lists to people aliases from the rules doc, tags a topic via a keyword pass, and resolves the access tag from the rules defaults. Writes one row totx-catalogueand emitstx.filed. Memory: 256 MB. Timeout: 60 s. No Bedrock calls.indexer— EventBridge rule ontx.filed. Chunks the transcript (~1 paragraph, sentence-aligned, small overlap, each chunk carrying its first-word start time), drops empty/silent chunks, calls Titan Text Embeddings V2 (amazon.titan-embed-text-v2:0) per chunk for a 1024-dim vector, and writes vectors with metadata (recording_id,start_ms,people,topic,access_tag,sensitive) to the S3 Vectors indextx-vectors. Flags the catalogue row searchable when all chunks land. Memory: 512 MB. Timeout: 120 s.search-handler— Lambda Function URL,AuthType: AWS_IAMfronted by your identity provider, or a signed session for the internal search UI. Embeds the query with Titan V2, queriestx-vectors(top-k with a metadata filter onsensitive=false), drops chunks whoseaccess_tagthe caller’s teams don’t cover, then calls Claude Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0viaglobal.anthropic.claude-haiku-4-5-20251001-v1:0) with the surviving chunks and a strict grounding prompt. Returns answer, quote, recording, and a deep link built fromstart_ms. Writes atx-searchlogrow. Memory: 512 MB. Timeout: 30 s.access-handler— Lambda Function URL for named opens of locked (sensitive) recordings. Verifies the caller is authorized for that specificrecording_id, returns the transcript or a signed audio URL, and writes atx-accessrow. This is the only path that can surface a sensitive recording, and it is always logged. Memory: 256 MB. Timeout: 15 s.digest— EventBridge Scheduler target, weekly Monday 9am. Readstx-searchlogandtx-cataloguefor the week; emails an admin summary via SES (new recordings filed, top searches, empty-result questions worth investigating). No Bedrock; plain summary table. Memory: 256 MB.
Storage
- S3 ·
tx-audio— source recordings. Versioning enabled. Lifecycle to Glacier Instant Retrieval at 60 days; no auto-expiry by default (recordings are the record). SSE-KMS with a dedicated key. - S3 ·
tx-transcripts— Transcribe JSON output, kept so the archive can be re-indexed without re-transcribing. Versioning enabled. SSE-KMS. - S3 ·
tx-raw-mime— raw inbound MIME from forwarded recordings, for provenance. Lifecycle to Glacier at 30 days; expiry at 7 years. - S3 ·
tx-rules-source— mirrored rules and access docs as plain text. Versioning enabled. - S3 Vectors ·
tx-vectors— the searchable index. 1024-dim vectors from Titan V2, one per kept chunk. Metadata per vector:recording_id,start_ms,people,topic,access_tag,sensitive. Queried top-k with a metadata pre-filter. - DynamoDB ·
tx-catalogue— one row per recording. PKrecording_id; attributes:title,date,people,topic,access_tag,sensitive,transcript_key,audio_key,indexed. On-demand. GSI ondatefor browse. - DynamoDB ·
tx-searchlog— one row per query. PK(user_id, ts); attributes:query,returned_ids,result_count,latency_ms. On-demand. No TTL — this is the long-term audit trail. - DynamoDB ·
tx-access— one row per named open of a locked recording. PK(recording_id, ts); attributes:user_id,reason,granted_by. On-demand. No TTL.
Amazon Transcribe
- Job config.
StartTranscriptionJobwithShowSpeakerLabels=trueandMaxSpeakerLabelstuned to room size;IdentifyLanguage=trueunless a fixed language is set in config. Output tos3://tx-transcripts/. Custom vocabulary (product names, people, acronyms) raises accuracy on domain terms. - Tiering. Connector-sourced jobs use the batch path (no latency pressure); forwarded recordings use standard. PII redaction can be enabled per access tag so transcripts of sensitive recordings store redacted text by default.
- Completion. Transcribe emits a job-state-change event to EventBridge; the
filerLambda triggers onCOMPLETEDand onFAILEDwrites the recording to a dead-letter prefix and alerts.
Bedrock
- Embeddings.
amazon.titan-embed-text-v2:0, 1024-dim, normalized. Two callsites: theindexer(one call per chunk at index time) and thesearch-handler(one call per query). The query and the chunks must use the same model and dimension. - Foundation model.
anthropic.claude-haiku-4-5-20251001-v1:0via the Global cross-Region inference profileglobal.anthropic.claude-haiku-4-5-20251001-v1:0. One callsite: thesearch-handler, composing the grounded answer. Sonnet 4.6 is not used — the answer is a short, grounded summary of a few chunks, well within Haiku’s range, and the cost difference matters at search volume. - Quotas. Default account quotas are more than enough at SMB volume. The expensive work is Transcribe, not Bedrock.
EventBridge and Scheduler config
tx-drive-sync—rate(5 minutes). Target:drive-syncLambda.tx-connector—rate(2 hours). Target:connectorLambda.tx-weekly-digest—cron(0 9 ? * MON *)in TZ. Target:digestLambda.- Transcribe completion rule — EventBridge rule on
aws.transcribeJob State Change → targetfilerLambda. tx.filedrule — custom-bus rule on thefiler’s emitted event → targetindexerLambda.
SES inbound and outbound
- Set the MX record on a dedicated subdomain (e.g.
archive.your-company.com) toinbound-smtp.ap-southeast-1.amazonaws.com. - SES inbound rule set
tx-inbound-rules: one rule with recipientarchive@your-company.com→ spam scan → S3 PUT tos3://tx-raw-mime/<message-id>→ stop. The S3 PUT triggersintake-ses-parser. - SES outbound for the weekly digest: verify a sender identity at
archive-bot@your-company.comwith DKIM and SPF on the parent domain. Out of sandbox by request.
IAM (least privilege per Lambda)
Each Lambda has its own role with policies scoped to exact ARNs. Sketch:
- transcribe role:
s3:GetObjectontx-audio;transcribe:StartTranscriptionJob;s3:PutObjectontx-transcripts;kms:Decrypt+GenerateDataKeyon the archive key. Nobedrock:*. - filer role:
s3:GetObjectontx-transcriptsandtx-rules-source;dynamodb:PutItemontx-catalogue;events:PutEventson the custom bus. Nobedrock:*. - indexer role:
s3:GetObjectontx-transcripts;bedrock:InvokeModelon the Titan ARN;s3vectors:PutVectorsontx-vectors;dynamodb:UpdateItemontx-catalogue(the indexed flag). - search-handler role:
bedrock:InvokeModelon the Titan ARN and the Haiku ARN;s3vectors:QueryVectorsontx-vectors;dynamodb:PutItemontx-searchlog;dynamodb:GetItemontx-catalogue. No write access to audio, transcripts, or vectors. - access-handler role:
dynamodb:PutItemontx-access;dynamodb:GetItemontx-catalogue;s3:GetObject+ presign ontx-audioandtx-transcriptsscoped per-request to the openedrecording_id. - drive-sync / connector / intake-ses-parser roles:
secretsmanager:GetSecretValueon the relevant secret;s3:PutObjectontx-audio(andtx-rules-sourcefor drive-sync); outbound network to the Google or meeting-tool API only.
Search surface
The search box is a small static page that posts the query and the caller’s identity token to the search-handler Function URL. Identity comes from your existing SSO (the Function URL is fronted by IAM auth or a short-lived signed session); the archive doesn’t run its own user store. The response is rendered as a short answer, the quote in a blockquote, the recording title and date, and a play button whose link carries the start_ms so the audio element seeks straight to the moment. Locked recordings never appear here; opening one is a separate, authorized action through access-handler.
Observability and cost gates
- CloudWatch Logs: all Lambdas, 7-day retention, structured JSON. Subscription filter on
"error"+"throttle"+"timeout"to a CloudWatch metric for alerting. - Alarms: Transcribe job failures > 0 in a day; indexer failures > 0 (an un-indexed recording is invisible to search); search-handler p95 latency > 4s; access-handler authorization failures > 5/hour.
- X-Ray: on for the
search-handleronly (the user-facing path); off elsewhere to save cost. - AWS Budgets: $25/month threshold, alarm at 80% and 100%, posts to SNS topic
tx-cost-alarmsubscribed to the on-call admin’s email.
Config and secrets
Service-account credentials for the Drive API live in Secrets Manager under tx/drive/sa; the meeting-tool OAuth token under tx/meeting/oauth. The connector cursor, the chunk-size and overlap settings, the topic and people-alias tables, the access defaults, and the SES sender identity all live in Parameter Store under /tx/config/ (the larger tables as JSON in tx-rules-source, mirrored from Drive). The KMS key id for the archive and the index region are also config. Lambdas fetch config on cold start and cache for the lifetime of the execution environment.
Deploy
GitHub Actions with OIDC into a deploy role — no long-lived keys — running AWS SAM. The opinionated bits: deploy the SES rule set as a separate stack (rule-set changes affect mail flow), turn on S3 versioning for tx-audio, tx-transcripts, and tx-rules-source so a bad sync or edit can be rolled back, and keep the KMS key and the S3 Vectors index in the same region as the audio for data-residency reasons. Total deployable surface: around nine Lambdas, three DDB tables, one S3 Vectors index, four S3 buckets, a handful of EventBridge rules, one SES rule set, and one Budgets alarm.
That’s the full system. Six narrative posts and this engineering reference. If you want to talk about adapting it for your business, see Work with me.
All posts