CVPR 2026 Findings

The Unwritten Benchmark

A New Challenge for Multimodal Machine Learning in Abstract Perceptual Reasoning

Garima Arya Yadav Nilay Yilmaz Yezhou Yang

Arizona State University

Abstract

Current multimodal models have demonstrated remarkable proficiency in recognizing static visual and auditory content. However, their capacity for abstract perceptual reasoning, inferring unseen information from dynamic, generative processes, remains a critical and underexplored frontier. We introduce The Unwritten Benchmark, a new challenge designed to probe this ability through acousto-kinematic word inference: models must decipher words, across three handwriting styles, being written solely from pen-scratch audio and hand motion, without any visible ink trace. Our evaluation reveals a profound gap between human and machine performance. Human participants achieve high ordered letter accuracy, while leading multimodal models struggle to surpass 10%. We also observe a paradoxical fusion effect, where providing both modalities can degrade performance rather than improve it, highlighting fundamental weaknesses in current multimodal reasoning.

Benchmark at a Glance

The benchmark isolates a simple but cognitively demanding question: can a model recover the symbolic output of writing by reasoning over the physical process that produced it?

3 writing styles
Standard, Cursive, Retrace

3 modalities
Audio, Muted Video, Audio+Video

409 letter-level primitive files used as building blocks

10,491 synthetic benchmark files generated for evaluation

80.78% human ordered letter accuracy on muted video

Task Overview

Each sample presents handwriting with no visible ink. The only available clues are the micro-kinematics of the hand and the sound of the pen interacting with paper. Word-level samples are synthesized by concatenating live-recorded letter primitives within a single writing style, enabling scalable generation while preserving the original motion and audio characteristics.

Humans reliably use these signals to infer the written word, but state-of-the-art multimodal models still fail by a wide margin, especially when they must integrate both modalities into a single judgment.

Selected Samples

The examples below show the benchmark format across styles and modalities. Each row contains a muted video sample, its corresponding audio-only recording, and the synchronized audio-video version.

Standard Style

Word sample: “above”

Muted Video

Audio Only

Audio + Video

Cursive Style

Word sample: “above”

Muted Video

Audio Only

Audio + Video

Live Word Recording

Natural sample: “above”

Muted Video

Audio Only

Audio + Video

Retrace Style

Letter sample: “a”

Muted Video

Audio Only

Audio + Video

Key Results

Ordered Letter Accuracy (OLA) from the paper shows a striking human-machine gap and a recurring multimodal fusion failure.

Model	Audio	Muted Video	Audio + Video
Human	19.47	80.78	77.01
Qwen2.5-Omni	7.83	5.03	3.95
GPT-4o	8.87	8.85	-
Gemini 2.5 Pro	10.04	8.71	9.07
Gemini 2.5 Flash	9.49	8.38	8.44

GPT-4o Audio+Video results are omitted in the paper because the evaluation setup did not support pre-merged AV files for that model.

Why This Benchmark Matters

The Unwritten Benchmark shifts evaluation away from recognizing explicit content and toward reasoning about a hidden outcome produced by a physical process. It highlights three persistent challenges for current multimodal systems: weak understanding of continuous motion, brittle cross-modal fusion, and limited causal reasoning over tightly synchronized audiovisual signals.

Citation

If you find the dataset or paper useful, please cite:

@inproceedings{yadav2026unwritten, title={The Unwritten Benchmark: A New Challenge for Multimodal Machine Learning in Abstract Perceptual Reasoning}, author={Yadav, Garima Arya and Yilmaz, Nilay and Yang, Yezhou}, booktitle={CVPR Findings}, year={2026} }