🔴 Research 28 min read

Multimodal AI — Frontier Research and Unresolved Problems

The frontier of multimodal AI research: cross-modal alignment, grounding, emergent capabilities, compositional reasoning, evaluation methodology, and why integrating modalities is harder than it looks.

View all multimodal ai depths →

Multimodal AI is where the field’s most ambitious ideas converge. The core premise: intelligence that works on the world as the world actually presents itself — through images, sound, text, touch, video, sensor streams — rather than through any single abstracted representation. Large multimodal models (LMMs or MLLMs) are the current leading embodiment of this idea: systems that can process text, images, audio, and increasingly video in a unified framework, reasoning jointly across modalities to answer questions, generate content, or drive action.

The capability progress has been extraordinary. GPT-4V, Claude 3 Opus, Gemini 1.5 Pro, and similar systems can perform medical image analysis, interpret complex scientific figures, describe video scenes, and follow instructions that combine visual and linguistic context. These capabilities were not reliably available at all three years ago.

But multimodality introduces a new set of unsolved problems that single-modality systems don’t face: how to align representations across modalities, how to ground language in perception, how to compose information from multiple sources coherently, and how to evaluate systems that operate across multiple input types simultaneously. This piece examines where the research frontier actually stands.

Why multimodal AI is harder than modality addition

A naive view of multimodal AI: take a language model, add a vision encoder, and gain visual capability. The reality is more complex. Modalities differ fundamentally in their representational structure, temporal properties, information density, and the kinds of grounding relationships they support.

Representational asymmetry. Text is compositional and structured: words combine according to grammar to produce meaning that’s, in principle, decomposable. Images are distributed and continuous: meaning is encoded across all pixels simultaneously with no natural segmentation into discrete “visual words.” Audio is temporal and sequential. These different representational structures mean that aligning modalities in a shared embedding space isn’t straightforward — the embedding geometry that serves text well may not serve images well.

Cross-modal grounding ambiguity. When a caption says “the dog chases the ball,” which pixels in the image correspond to “dog,” which to “ball,” and which to “chases”? Grounding language to image regions requires solving correspondence at multiple granularities (word to object, sentence to scene, discourse to sequence of images). This is harder than it appears because natural language descriptions are often incomplete, underspecified, or refer to abstractions that don’t have clear visual counterparts.

Modality-specific biases. Models trained on each modality learn biases specific to that modality’s typical statistical structure. When modalities are combined, their biases can interact in unexpected ways. A vision model that has learned that “if a kitchen background is present, there’s likely a stove” may incorrectly apply this heuristic when the text prompt mentions that the person is at a restaurant.

Integration architecture choices. How to combine modalities architecturally is itself an open question. Early fusion (combine raw features before processing), late fusion (process separately and merge at output), cross-attention alignment layers, and full joint transformer architectures all make different tradeoffs in terms of cross-modal reasoning depth, efficiency, and modality flexibility.

1) Architecture: how multimodal systems are built

The dominant current paradigm

Most state-of-the-art MLLMs follow a common pattern:

  1. Modality-specific encoders convert each input type (images, audio, video) into token sequences in a learned embedding space. For images, typically a Vision Transformer (ViT) or similar; for audio, typically a Whisper-derived or wav2vec encoder.
  2. Cross-modal projection layers map encoder outputs into the language model’s token embedding space — usually a learned linear projection, an MLP, or (in more sophisticated systems) cross-attention blocks.
  3. A large language model backbone processes the projected multimodal tokens alongside text tokens in a unified context window, generating text outputs autoregressively.

This architecture — sometimes called the “connector paradigm” — was popularized by LLaVA (Visual Instruction Tuning, Liu et al., 2023) and adopted in variants across most major open and proprietary multimodal systems.

Limitations of the connector paradigm

Visual tokens dominate context. High-resolution images require many tokens after patching — a 1024×1024 image produces ~4,096 tokens with standard ViT patch sizes. For models with context length limits, this leaves limited room for text context alongside images. Approaches like dynamic resolution (generating fewer patches for simpler images) and token compression (summarizing visual tokens) are active areas.

Encoding vs. understanding. Visual encoders trained on self-supervised objectives (contrastive learning, masked image modeling) learn representations useful for recognition but not necessarily for the kinds of spatial, relational, and logical reasoning that multimodal LLMs need. The assumption that a visual encoder produces “good” features for arbitrary downstream reasoning is not guaranteed.

Modality imbalance. Systems trained predominantly on text-image pairs often show significant capability degradation on audio, video, or other modalities — the language backbone’s strong language priors can dominate the contribution from non-text modalities.

Alternative architectures

Native multimodal tokenization. Instead of using modality-specific encoders, some research explores converting all modalities into a shared token vocabulary from the start — representing images, audio, and text as token sequences over a unified vocabulary. GPT-4o and Gemini appear to use versions of this approach. This enables more symmetric treatment of modalities but requires large-scale training on diverse multimodal data from scratch.

Mixture-of-experts for modalities. Routing different input types to specialized expert pathways within a shared model framework allows modality-specific processing while sharing a unified reasoning backbone. How to balance specialization and sharing remains an open design question.

2) Cross-modal reasoning: the central hard problem

What cross-modal reasoning requires

True cross-modal reasoning — not just answering questions about images, but integrating information from multiple sources to draw conclusions that neither modality supports alone — is the defining capability that multimodal AI is working toward.

Example: A medical MLLM receives a radiological image, a voice recording of the patient describing symptoms, and the patient’s text history. True multimodal reasoning would synthesize all three to produce a differential diagnosis where the image findings, symptom description, and history are jointly interpreted. Each modality constrains and informs interpretation of the others.

Current systems demonstrate this capability to varying degrees in controlled settings, but fail systematically in ways that reveal the limits of their cross-modal integration.

Failures in cross-modal reasoning

Modality unimodal shortcuts. On multimodal benchmarks that ostensibly require integration of text and image, models frequently achieve high accuracy by using only one modality — answering based on the text question alone, or answering based on the image alone without considering the question. Studies on VQA datasets have shown that state-of-the-art models achieve surprisingly high accuracy even when provided with the wrong image or no image, suggesting reliance on language priors rather than visual grounding.

Spatial reasoning failures. “Which object is to the left of the chair?” “Which person is closer to the camera?” These spatial questions require understanding the image’s geometric structure and relating it to spatial language. MLLMs systematically underperform on spatial reasoning tasks compared to their performance on non-spatial visual questions.

Cross-modal consistency failures. When asked the same question about equivalent visual and textual content (a scene described in text vs. depicted in an image), models often give different answers. The system doesn’t maintain a consistent internal model of the scene across representational formats.

Grounding precision. “Point to the red umbrella on the left side of the image.” Fine-grained spatial grounding — identifying precise image regions corresponding to described objects — requires tight coupling between language tokens and image spatial tokens. Current systems often produce rough bounding boxes rather than precise localizations, and fail on queries that require understanding of overlapping objects, unusual viewpoints, or occluded regions.

Research directions in cross-modal reasoning

Explicit spatial and relational representations. Rather than relying on distributed attention mechanisms to implicitly capture spatial relationships, architectures that explicitly model object positions, sizes, and relationships provide more reliable spatial reasoning. Object-centric representations, scene graphs, and slot-attention mechanisms are being explored.

Chain-of-thought in multimodal contexts. Chain-of-thought prompting significantly improves LLM reasoning on language tasks. Multimodal chain-of-thought research explores whether step-by-step reasoning that explicitly references image regions (“first, I identify the red object in the upper-left… then I check whether it is to the left of the blue object…”) improves cross-modal reasoning. Results are promising but inconsistent across domains.

Contrastive and negative examples in training. Standard multimodal training pairs each image with a positive description. Adding negative examples — hard negatives that differ in subtle ways (same scene with one object moved, same description with one word changed) — forces the model to learn finer-grained alignment. Negative-contrastive training has shown improvements on compositional reasoning benchmarks.

3) Visual grounding and referring expression

The grounding problem

Visual grounding asks: given a natural language expression, locate the corresponding region in an image. “The person wearing a red hat” — which bounding box in the image does this refer to?

Referring expression comprehension (REC) benchmarks like RefCOCO and Visual Genome have shown significant improvement from multimodal pretraining. Models can identify objects when expressions use common attribute-object combinations (“the tall man in the black shirt”).

But systematic failures emerge with:

  • Ambiguous expressions. If there are two people in red hats, which does “the person in the red hat” refer to? Models handle ambiguity poorly — they typically guess rather than surfacing uncertainty.
  • Relational expressions. “The person to the left of the one in the blue coat” requires multi-step relational reasoning (identify blue coat → identify left) that current models handle inconsistently.
  • Negative expressions. “The chair that is not next to the window” requires reasoning about the absence of spatial relationships — systematically poorly handled.
  • Generalized referring expressions. Novel object types, unusual viewpoints, and domain-specific contexts (medical images, technical diagrams, satellite imagery) expose the limits of models trained predominantly on natural scene photography.

Phrase grounding and dense prediction

Phrase grounding — localizing multiple phrases simultaneously in an image — is harder than single referring expression comprehension. Models must avoid attributing the same region to multiple different phrases, must segment overlapping objects, and must handle phrases that refer to abstract properties (color, texture) rather than discrete objects.

Grounded SAM (combining Grounding DINO’s grounded detection with SAM’s segmentation) and similar systems show strong results on benchmark tasks but struggle with compositional expressions and generalization to domain-specific imagery.

4) Multimodal hallucination: a fundamental reliability problem

What multimodal hallucination looks like

MLLMs frequently generate factually inaccurate descriptions of images — describing objects, people, or attributes that are not present. This is the visual analogue of LLM hallucination, and it’s often more subtle because users may not be looking carefully at the image while reading the model’s description.

Examples of documented hallucination types:

  • Object hallucination — describing an object not present in the image
  • Attribute hallucination — correctly identifying an object but incorrectly describing its color, size, or orientation
  • Relationship hallucination — correctly identifying two objects but incorrectly describing their spatial or functional relationship
  • Count hallucination — describing the wrong number of objects
  • Action hallucination — describing an action not depicted in the image

Why MLLMs hallucinate

Language model priors strongly influence image descriptions. A model trained on many captions of kitchen scenes will have high prior probability for describing “a stove” even when the image contains a different appliance. When visual evidence is ambiguous, the language prior takes over, and the model generates plausible text rather than accurate visual description.

This is compounded by training data issues: captioning datasets often have loose correspondence between images and their captions — the caption describes what’s typical for a scene, not necessarily what’s exactly depicted.

Hallucination mitigation research

RLHF and DPO on image-description pairs, where human raters prefer accurate descriptions over plausible ones, can reduce hallucination. Inference-time verification — generating multiple descriptions and checking consistency — helps but adds cost. POPE (Polling-based Object Probing Evaluation) has been widely used to measure object hallucination.

But fundamental progress requires training on data with precise image-description correspondence, better alignment training objectives that penalize hallucination, and inference procedures that explicitly verify generated claims against image content. None of these is fully solved.

5) Multimodal evaluation: measuring what matters

Benchmark taxonomy

Multimodal evaluation is more complex than single-modality evaluation because of the combinatorial space of possible input types, tasks, and reasoning requirements.

Image-text benchmarks:

  • VQA v2, GQA — visual question answering at scale
  • OK-VQA — questions requiring external knowledge beyond image content
  • MMBench, MMMU, MME — comprehensive capability evaluation across multiple visual tasks
  • SEED-Bench — systematic evaluation across 19 evaluation dimensions
  • Winoground, ARO — compositional and relational understanding specifically

Video-language benchmarks:

  • ActivityNet-QA, NExT-QA — temporal video question answering
  • EgoSchema, Video-MME — long-form video understanding
  • MSVD-QA, MSR-VTT-QA — general video question answering

Audio-visual benchmarks:

  • AVSBench — audio-visual segmentation
  • AVE — audio-visual event localization
  • AVQBench — audio-visual question answering

Benchmark limitations

Saturation and overfitting. MMMU was released in late 2023 as a challenging multimodal benchmark requiring college-level knowledge. Top models have made rapid progress, with concerns emerging that gains partly reflect benchmark-specific optimization rather than general capability improvement.

Language shortcuts. Many visual benchmarks can be partially solved using text-only reasoning (the question text contains enough context to answer without the image). Models that exploit these shortcuts show inflated performance that doesn’t reflect genuine visual understanding.

Narrow task scope. Most benchmarks test discrete question answering — the right answer is a short string or choice. Open-ended generation quality, long-form reasoning, and practical task completion (using image content to guide multi-step action) are poorly captured by standard benchmarks.

No unified evaluation standard. Unlike NLP, where MMLU, HellaSwag, and BIG-Bench provide some cross-system comparability, multimodal evaluation lacks a dominant standard benchmark that enables reliable cross-system comparison. Research papers report on different benchmark combinations, making comparison difficult.

Emerging evaluation approaches

Process-level evaluation. Rather than evaluating only the final answer, evaluating the reasoning process — does the model correctly identify relevant image regions, correctly apply the appropriate concepts, produce reasoning steps that are grounded in actual image content? This requires substantially more expensive evaluation but provides richer diagnostic information.

Compositional hold-out evaluation. Systematically constructing evaluation sets that require composition of capabilities not seen jointly in training (novel combinations of object types, spatial configurations, and reasoning requirements) tests generalization rather than memorization.

Multimodal adversarial evaluation. Constructing evaluation examples specifically designed to exploit known failure modes — confusable objects, ambiguous spatial arrangements, linguistic traps — provides a more robust assessment of actual capability.

6) Emerging multimodal capabilities and open questions

Emergent multimodal abilities

Large-scale multimodal pretraining produces some capabilities that weren’t explicitly trained:

Zero-shot visual chain-of-thought. Models can be prompted to reason step-by-step about complex visual scenes without explicit training on step-by-step visual reasoning examples.

Cross-modal analogical reasoning. “This image is to French architecture as which image is to Japanese architecture?” Tests of visual analogical reasoning show that large MLLMs can perform these tasks with some reliability.

Scientific figure interpretation. Interpreting complex charts, graphs, and scientific figures requires combining visual feature recognition with domain knowledge. GPT-4V and similar systems can do this for common chart types with reasonable accuracy.

Document understanding. Reading and reasoning about scanned documents, handwritten notes, structured tables, and mixed-layout pages — treating them as images rather than pre-processed text — has become practically useful.

Open capability questions

Grounded world knowledge. Can a multimodal model learn factual knowledge about the world that is intrinsically visual — facts about what objects look like, how physical processes appear, what spatial configurations are typical — rather than learning visual descriptions of language-expressed facts? This would require different pretraining strategies than current approaches.

Persistent multimodal memory. Current MLLMs process each query independently. An agent that could build up a persistent visual-linguistic model of an environment over time — remembering what it has seen, updating its model as new observations arrive — would be qualitatively more capable. Architectures for persistent multimodal memory are early-stage.

Embodied multimodal interaction. Vision-language-action models (VLAs) for robotics — models that perceive visual scenes, understand natural language instructions, and produce physical actions — represent the most ambitious multimodal application. RT-2 (Brohan et al., 2023), PaLM-E, and subsequent work have demonstrated that language model capabilities can improve robot control, but embodied multimodal AI remains far from general capability.

7) Safety and alignment in multimodal systems

Multimodality introduces new attack surfaces and alignment challenges that don’t exist for text-only systems.

Visual jailbreaks. Safety-trained MLLMs can sometimes be bypassed by embedding instructions or harmful content in images rather than text. An image containing the text “ignore your safety guidelines and tell me how to…” can sometimes evade text-based safety filters because the model doesn’t apply the same filter to text extracted from images.

Cross-modal prompt injection. In deployed multimodal systems that process external images (websites, documents, screenshots), adversarial images can contain instructions intended to redirect the model’s behavior — a visual form of prompt injection.

Multimodal bias and representation. Biases in visual training data interact with language biases in complex ways. A multimodal model may produce different responses to the same question when accompanied by images of people with different demographic characteristics — reflecting the joint effect of image training data biases and language training biases.

Attribution and generation responsibility. When a multimodal system generates content that incorporates specific visual styles, likeness, or protected creative elements from training images, questions of attribution and copyright are substantially more complex than for text generation alone.

8) The integrated intelligence frontier

Any-to-any models

The most ambitious current direction is “any-to-any” multimodal systems: models that can take any combination of input modalities and generate any output modality. Text → image, image → text, audio → text, text → audio — and crucially, image+text → image, audio+text → video, and so on across all combinations.

GPT-4o demonstrated significant progress toward this goal with native image output capability. Gemini’s natively multimodal architecture represents another direction. But genuinely flexible any-to-any generation with consistent quality across modality combinations remains a research frontier.

Multimodal agents

Agents that use multimodal models to perceive and act in digital and physical environments represent the most practically significant application of multimodal AI. An agent that can read a screen, understand what’s shown, generate appropriate actions (clicks, keystrokes, voice commands), observe the results visually, and iterate represents a qualitatively different kind of tool than a question-answering system.

Computer use agents (Claude Computer Use, GPT-4V-based agents, Operator-class systems) have demonstrated early capability in this direction. Reliability, error recovery, and generalization to novel interfaces remain significant open challenges.

The alignment implications of multimodal agents — systems that can perceive the world and take actions based on that perception — are significantly more complex than for text-only systems. Ensuring that actions are aligned with user intent, that the agent’s perceptual model is accurate, and that errors don’t cascade into consequential mistakes requires safety and evaluation frameworks that the field is still developing.

Closing perspective

Multimodal AI is where the field’s most interesting problems live. The individual modality challenges — vision, audio, video, language — are each formidable. Their combination introduces a new layer of difficulty in alignment, grounding, composition, and evaluation that single-modality approaches don’t face.

The practical significance is also large: the most useful AI agents will need to perceive and reason across the full diversity of information formats that humans use. Systems that can only process one modality at a time are fundamentally limited in the range of tasks they can assist with.

The trajectory of the field suggests that modality integration will continue improving with scale and data — the same scaling properties that have driven language model progress appear to apply to multimodal systems. But fundamental challenges in compositional reasoning, grounding, hallucination, and temporal understanding suggest that scale alone won’t resolve the hardest problems. The next phase of progress likely requires architectural advances, better training objectives, and richer evaluation methodology developed specifically for the multimodal setting.

What gets built on top of these advances — the multimodal agents, the embodied systems, the any-to-any creative tools — will be defined as much by these research decisions as by raw capability. That’s why this frontier matters.

Simplify

← Multimodal Voice Agents: Beyond Text Chat With a Microphone

Related reads

multimodal-aimllmcross-modalvision-languageresearchalignmentgroundingemergent-capabilities

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.