Problem
Media organisations sit on enormous video archives — decades of broadcasts, interviews, press conferences. The metadata is shallow: headlines and dates, sometimes a transcript, rarely anything searchable by meaning. A journalist hunting "comments by central bankers about inflation expectations between 2022 and 2024" has nothing useful to query against.
Thomson Reuters wanted to change that for their internal teams: surface the right clips by intent, not keyword, and produce summaries that respect the editorial tone.
Process
I joined as the AWS engineer leading the technical delivery. The work split into three streams:
- Ingest pipeline. Take a video, extract audio, transcribe with timestamps, chunk by topical boundaries, embed each chunk for semantic search.
- Retrieval and summarisation. A query hits OpenSearch (vector + lexical), the top chunks plus their surrounding context fly to a Bedrock model, which produces a summary citing each clip with its in-video timecode.
- A reference UI. A React front-end demonstrating the search-and-summarise loop, built so customers could pick it up as an AWS solution and customise it.
The engineering challenge wasn't any single component — it was making the latency-cost-quality triangle work. Embeddings for an entire archive are expensive. Summarisation tokens add up. Customers want sub-second search.
I drove the architectural decisions on chunking strategy (semantic over fixed-size), the choice of Bedrock model per stage (smaller embedding model, larger summarisation model), and the caching layer that made repeated queries cheap.
Outcome
The solution was published as an official AWS reference architecture and showcased at IBC 2024 — broadcast industry's flagship event in Amsterdam. Customer-facing assets included the architecture diagrams, sample code, and the reference UI. Internal AWS teams have used it as the basis for further media-vertical engagements.
For engineersTechnical Deep DiveExpand
Pipeline architecture
Video uploaded → S3
│
▼
EventBridge → Step Functions
│
├─ Extract audio (Lambda + ffmpeg layer)
├─ Amazon Transcribe (with speaker diarisation)
├─ Topical chunking (Bedrock — Claude small model)
├─ Embeddings per chunk (Bedrock — Titan / Cohere)
└─ Index to OpenSearch (vector + keyword)
Search query
│
▼
API Gateway → Lambda
├─ Hybrid search (kNN + BM25 in OpenSearch)
├─ Re-rank top-k by relevance score
├─ Fan-out context retrieval (timecodes ± window)
└─ Bedrock summarisation (large model, RAG prompt)
Why semantic chunking
Fixed-size chunks (every 30s, every 1k tokens) create artificial boundaries. A speaker midway through a thought gets cut. Topical chunking — using a smaller LLM to identify natural boundaries from the transcript — produced chunks where each was a self-contained idea. Empirically: ~25% lift in retrieval relevance over fixed-size at the same chunk count.
Hybrid search, not pure vector
Pure vector search misses exact-match queries. A journalist searching for a specific name wants lexical hits. We ran both vector kNN and BM25 against OpenSearch, then merged with a weighted RRF (reciprocal rank fusion) score. The weight was tunable per customer.
Citation discipline
The summarisation prompt required the model to cite each claim with [clip_id:HH:MM:SS] markers, parsed back into clickable timecodes in the UI. We refused to render summary text without at least one citation per sentence — anything ungrounded was filtered out. This kept the model honest.
Latency budget
| Stage | Budget | Achieved |
|---|---|---|
| OpenSearch query | 200ms | ~120ms |
| Re-rank + context | 100ms | ~80ms |
| Bedrock summarisation | 2s | 1.4–2.1s |
| Total p95 | ~2.5s | ~2.3s |
Search-only (no summary) was sub-second. The summary streamed into the UI as it generated, so perceived latency was ~600ms before the first token appeared.
Cost shape
Embedding cost scales with archive size (one-time per video) and is amortised. Summarisation cost scales with queries — the dominant ongoing line. We added an (query_hash, archive_version) cache so identical or near-identical queries within 24 hours returned from DynamoDB. Cache hit rate for the demo workload was ~30% within hours of a news event, much higher when teams ran similar searches on the same topic.
Trade-offs
- No fine-tuning. We deliberately avoided fine-tuning the summarisation model. RAG with strong citations gave better factual grounding and made the solution portable across customers without per-customer training.
- Vector store choice. OpenSearch was chosen over Pinecone or pgvector because it ran inside the customer's AWS account with the existing data residency story. Slower to optimise, but the right blast radius for a regulated customer.
