🟣 Technical 9 min read

Audio-Visual Multimodal Models: How They Work and What They Can Do

The next frontier for MLLMs isn't just text + images — it's audio and video. Here's how audio-visual models work and what capabilities they enable.

View all mllms depths →

The progression of multimodal large language models has followed a clear pattern: text-only → text + images → text + images + audio → native video understanding. Each step expands what these models can perceive and reason about — and opens up genuinely new application categories.

Audio-visual MLLMs are the current frontier. Understanding how they work and what they can do is increasingly essential for AI practitioners.

The architecture challenge

Extending language models to handle audio and video isn’t just a matter of adding another input modality. Each modality has different:

  • Dimensionality: Audio is 1D (waveforms), images are 2D (pixels), video is 3D (frames over time), text is discrete tokens
  • Temporal structure: Audio and video are time-indexed; images and text have different notions of sequence
  • Information density: A 10-minute video contains vastly more raw data than any text transcript
  • Semantic content: The same event carries different information in its visual, auditory, and linguistic channels

The fundamental technical challenge: how do you get a language model — whose native vocabulary is text tokens — to process and reason across all these modalities?

How audio is encoded for language models

Audio cannot be directly processed by a language model. It must be converted into a representation the model can handle. Two main approaches:

Discrete audio tokens

Convert audio into a sequence of discrete tokens, similar to text. Audio codec models (like EnCodec, DAC) learn to compress audio into sequences of integer codes that can be reconstructed. These codes are then treated like tokens and processed by the language model.

Advantages:

  • Audio is in the same representation space as text — the model can generate audio by predicting token sequences
  • True end-to-end audio processing and generation (Moshi, the real-time voice model, uses this approach)

Disadvantages:

  • Vocabulary explosion (audio requires many more distinct tokens than text)
  • High sequence length (1 second of audio → many tokens)
  • Lossy compression introduces artifacts

Continuous audio embeddings

Pass audio through a dedicated audio encoder (like Whisper, wav2vec, or a custom model), extract continuous embeddings, project them into the language model’s representation space via a linear or MLP adapter.

This is the approach used in most current multimodal models (Gemini, GPT-4o). The audio encoder handles the perceptual processing; the language model handles reasoning.

Advantages:

  • Leverages strong pre-trained audio encoders
  • More efficient (audio compressed to fewer embedding vectors than discrete tokens)
  • Separates audio perception from reasoning

Disadvantages:

  • Model cannot directly generate audio (only process it); separate audio decoder needed for output
  • Adapter introduces an additional component to train and align

How video is handled

Video is the most challenging modality for MLLMs because of its sheer data volume. A 60-minute video at 24fps and standard resolution is billions of pixels of data — feeding this raw is completely infeasible.

Current approaches to video understanding:

Frame sampling

Sample N frames from the video at uniform or adaptive intervals, process each frame as an image (standard vision encoder), concatenate the frame embeddings, and provide them to the language model along with any text/audio.

This is simple and works well for:

  • Answering questions about what’s in a video
  • Summarizing video content
  • Identifying scenes and objects

Limitations: Misses temporal dynamics. Fast motion, brief events, and timing-specific queries are poorly handled by uniform sampling.

Temporal encoders

Specialized video encoders (Video-LLaMA, ViViT, TimeSformer) designed to model temporal relationships between frames. Temporal attention mechanisms consider how features change across time, not just what appears in each frame.

This improves performance on:

  • Action recognition (“What is the person doing?”)
  • Event timing (“When does X happen?”)
  • Causality reasoning (“Why did X happen?”)

Efficient compression for long video

For long-form video understanding (lectures, films, extended meetings), the frame volume is too large for any naive approach. Current techniques:

  • Hierarchical processing: Summarize short clips, then summarize summaries
  • Event detection + sparse sampling: Detect scene changes or activity spikes, sample more densely around them
  • Memory-augmented processing: Use external memory to accumulate long-range context

Full-length video understanding at high quality is still an active research problem.

Omni-modal models: handling all modalities natively

The most ambitious direction is models that natively handle arbitrary combinations of text, images, audio, and video as input and output.

GPT-4o (the “omni” in the name) was an early step: real-time audio input and output with visual understanding. The architecture keeps audio in a native representation rather than converting to text first — enabling the lower latency and more natural voice conversation that makes it feel different from previous voice interfaces.

Gemini 1.5/2.0 was designed from the ground up to be natively multimodal. Its architecture processes interleaved sequences of text, image, audio, and video tokens. It holds the record for longest context window (1M+ tokens), which when applied to video means it can reason over very long video content.

Moshi (from Mistral/Kyutai) is the most sophisticated real-time audio-language model: it processes audio input and generates audio output natively, enabling full-duplex conversation (talking and listening simultaneously) with interruption handling — something previous voice assistants couldn’t do.

Key capabilities enabled by audio-visual MLLMs

Real-time voice assistants with visual context

“Look at this circuit board — why isn’t it working?” The model sees what you’re showing it, hears your question, and responds by voice. No transcription step, no mode switching.

This is the direction voice assistants are evolving. The camera + mic as a continuous context channel rather than discrete query-response.

Automated video analysis

  • Meeting recordings → structured summaries with speaker attribution, action items by speaker, timeline of topics discussed
  • Training video → automated quiz generation, knowledge gap identification
  • Surveillance or monitoring video → event detection and structured logging
  • Medical imaging video (endoscopy, ultrasound) → real-time analysis and flagging

Find moments in video by describing them in text or audio. “Find the part where she mentions Q3 projections.” “Show me all frames where someone is writing on a whiteboard.”

This requires models that can bridge semantic understanding across modalities — understanding that a text query and a visual or audio event represent the same concept.

Audio-conditioned image and video generation

“Generate a music video that matches the mood of this track” — the model analyzes the audio’s tempo, key, mood, and instrumentation, then guides visual generation accordingly. Or more directly: generate animation that is synchronized to beat and dynamics.

Real-time scene description for accessibility

A live feed of visual and audio information → narration for visually impaired users. The model needs to track what’s new, what’s changed, what’s relevant — continuous understanding rather than single-shot captioning.

Current limitations

Temporal resolution: Current models sample video sparsely. Events that happen in less than a second are often missed or described incorrectly.

Audio-visual synchronization understanding: Catching that the footsteps on screen don’t match the sound (indicating editing), or that lip sync is off, requires precise temporal alignment models don’t always maintain.

Long-form coherence: Understanding a 2-hour film requires reasoning about plot arcs, character development, and callbacks across many thousands of frames. Current models handle this inconsistently.

Real-time processing: Processing video in real time at high quality still requires significant compute — not yet practical on consumer hardware without quantization and optimization.

Native audio generation quality: When models generate audio, quality (especially for music or rich sound design) lags behind specialized audio generation models.

Where this is heading

The trajectory is toward models that are genuinely perceptually present — able to see, hear, and understand the world more fully over extended time periods. The architectural innovations enabling this (efficient long-context processing, better temporal modeling, native audio representation) are all advancing rapidly.

The near-term impact: better real-time voice interfaces, automated analysis of video and audio content at scale, and accessibility tools that provide genuinely useful real-world scene understanding. The longer-term: AI systems that can participate in video calls, watch presentations, review recorded content, and reason across extended audiovisual contexts with the same fluency they currently bring to text.

Simplify

← Vision-Language Models: How MLLMs Understand Images and Text Together

Go deeper

Benchmarking Multimodal LLMs: What to Measure and How →

Related reads

mllmsmultimodalaudio-visualvideo-understandingomni-modelsgemini

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.