AI Video Understanding: Transcription, Summarization, and Analysis

Video is the format most human knowledge now lives in — meetings, training, presentations, interviews, educational content, product demos. Most of this is effectively unsearchable: you can’t Ctrl+F a recording, and watching hours of video to find one relevant segment is painfully slow.

AI video understanding tools are changing this. Here’s what works now.

The core capabilities

Transcription

Automated speech recognition (ASR) has become accurate enough for most production use cases. Whisper (OpenAI) set a new quality bar in 2022 and remains the most widely deployed model, but several alternatives now compete at similar quality:

OpenAI Whisper: Open-source, very high accuracy, supports 99 languages. Available as an API or self-hosted.
Deepgram: Commercial API with strong real-time and batch transcription. Nova-3 model is competitive with Whisper on accuracy, faster for real-time.
AssemblyAI: Adds speaker diarization, sentiment, topic detection, auto-chapters on top of ASR.
Rev.ai: Higher accuracy for difficult audio (heavy accents, poor recording conditions) via hybrid human+AI.
Google Cloud Speech-to-Text: Competitive, integrates well in GCP ecosystems.

For most use cases, Whisper (via API or self-hosted) is the right starting point. Self-hosting with faster-whisper on a single GPU processes roughly 5-10x real-time (1 hour of audio in 6-12 minutes) at near-zero marginal cost.

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("meeting_recording.mp4", beam_size=5)

transcript_with_timestamps = []
for segment in segments:
    transcript_with_timestamps.append({
        "start": segment.start,
        "end": segment.end,
        "text": segment.text.strip()
    })

Speaker diarization

Knowing who said what is often as important as what was said. Speaker diarization assigns speaker labels to transcript segments.

pyannote.audio is the open-source standard:

from pyannote.audio import Pipeline
import torch

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your-hf-token"
)
pipeline.to(torch.device("cuda"))

diarization = pipeline("meeting.mp4")

# Combine with transcript using timestamp alignment
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"[{turn.start:.2f}s - {turn.end:.2f}s] {speaker}")

Combined with Whisper transcription and timestamp alignment, this produces labeled transcripts:

[0:00 - 0:12] SPEAKER_1: Let's get started with the Q3 review.
[0:13 - 0:45] SPEAKER_2: Sure, I'll start with the revenue numbers...

Video summarization

Once you have a transcript, LLM summarization is straightforward. The non-trivial parts are:

Handling long videos: A 2-hour meeting transcript might be 30,000+ words — too long for a single LLM call. Strategies:

Chunked summarization: Summarize each 10-minute segment, then summarize the summaries
Hierarchical summarization: Multiple levels of summary for different detail needs
Map-reduce: Parallel summarization of chunks, then synthesis

Structured summaries: Generic “summarize this” produces generic output. Better prompts produce structured, usable output:

prompt = """You are analyzing a transcript of a business meeting.

Produce a structured summary with:
1. MEETING OVERVIEW (2-3 sentences: purpose, attendees, outcome)
2. KEY DECISIONS (bulleted list — only actual decisions made, not discussions)
3. ACTION ITEMS (format: "- [Name] will [action] by [deadline if mentioned]")
4. OPEN QUESTIONS (unresolved questions that require follow-up)
5. KEY TOPICS DISCUSSED (brief bullet for each major topic, with approximate timestamp)

Transcript:
{transcript}
"""

Topic segmentation: Identify when the meeting shifts topics and label each segment. Useful for creating navigable meeting recordings.

Visual content analysis

Pure transcription misses what’s visible in the video — slides, demos, whiteboards, product demonstrations, non-verbal cues.

For slide and document content in recordings:

Extract frames at regular intervals or on scene changes
OCR or vision model analysis per frame
Merge visual content with timestamped transcript

import cv2
from openai import OpenAI
import base64

client = OpenAI()

def analyze_frame(frame_data: bytes) -> str:
    base64_image = base64.b64encode(frame_data).decode('utf-8')
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe what's on screen. If there are slides or text, transcribe the key content."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ],
        max_tokens=300
    )
    return response.choices[0].message.content

# Sample frames at 1 per second
cap = cv2.VideoCapture("presentation.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
frame_analyses = []

frame_number = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    
    if frame_number % int(fps) == 0:  # Once per second
        timestamp = frame_number / fps
        _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
        analysis = analyze_frame(buffer.tobytes())
        frame_analyses.append({"timestamp": timestamp, "content": analysis})
    
    frame_number += 1

cap.release()

Native multimodal video analysis with Gemini

For shorter videos (under 1 hour), Google’s Gemini API can process video natively:

import google.generativeai as genai

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel("gemini-2.0-flash")

# Upload video file
video_file = genai.upload_file("demo_video.mp4")

# Wait for processing
import time
while video_file.state.name == "PROCESSING":
    time.sleep(2)
    video_file = genai.get_file(video_file.name)

# Query the video
response = model.generate_content([
    video_file,
    "Summarize the key product features shown in this demo video. Include specific timestamps for each feature."
])

print(response.text)

Gemini processes video natively — it sees the frames and hears the audio without your having to extract them separately. Quality is strong for general-purpose video understanding tasks.

Production pipeline architecture

For a production video intelligence system:

Video Input
    ↓
[Stage 1: Audio extraction + Whisper transcription]
    ↓
[Stage 2: Speaker diarization + timestamp alignment]
    ↓
[Stage 3: Frame extraction + scene detection]
    ↓
[Stage 4: Frame analysis (slides, whiteboards, screen content)]
    ↓
[Stage 5: Multimodal merge — transcript + visual content + timestamps]
    ↓
[Stage 6: LLM synthesis — summaries, action items, topics]
    ↓
[Output: Searchable transcript, structured summary, topic index]

Key considerations:

Parallelize stages 2, 3, and 4 — they’re independent
Cache intermediate results (transcripts, frame analyses) to avoid reprocessing
Build a search index over the timestamped transcript chunks (embeddings + BM25)
Store structured outputs (action items, decisions) in a database, not just text

Real applications being built now

Meeting intelligence: Automated notes, action items, and summaries from Zoom/Teams recordings. Companies like Otter, Fireflies, and Notion AI are commercial implementations. Building a custom version is straightforward with Whisper + LLM.

Training and onboarding content indexing: Make hours of instructional video searchable by topic and question. “What did the training say about the refund policy?” becomes answerable.

Legal and compliance: Deposition transcription with speaker attribution, automated compliance review of recorded customer service calls.

Education: Automatic chapter creation, quiz generation from lecture content, topic-based navigation of course recordings.

Media production: B-roll discovery in large footage archives, automated logging, content identification for licensing.

Customer interviews: Insight extraction across user research recordings at scale.

Cost and latency benchmarks

For a rough 1-hour video:

Whisper transcription (self-hosted, GPU): ~6-10 minutes, ~$0.01-0.05 marginal cost
Whisper API: ~$0.36 (at $0.006/minute)
Speaker diarization: ~3-5 minutes self-hosted, $0.05-0.15 via API
LLM summarization (1hr transcript): ~$0.05-0.20 depending on model and length
Frame analysis (sampled): $0.10-0.50 depending on sampling rate and model

Total cost for full analysis of a 1-hour meeting: roughly $0.20-1.00. At scale, self-hosting ASR and diarization is significantly cheaper.

The tools and the economics are both mature enough for production. If your organization has significant video content that’s currently unsearchable, this is a high-ROI application area.