CVPR 2026 Findings
The Unwritten Benchmark
A New Challenge for Multimodal Machine Learning in Abstract Perceptual Reasoning
Arizona State University
Abstract
Current multimodal models have demonstrated remarkable proficiency in recognizing static visual and auditory content. However, their capacity for abstract perceptual reasoning, inferring unseen information from dynamic, generative processes, remains a critical and underexplored frontier. We introduce The Unwritten Benchmark, a new challenge designed to probe this ability through acousto-kinematic word inference: models must decipher words, across three handwriting styles, being written solely from pen-scratch audio and hand motion, without any visible ink trace. Our evaluation reveals a profound gap between human and machine performance. Human participants achieve high ordered letter accuracy, while leading multimodal models struggle to surpass 10%. We also observe a paradoxical fusion effect, where providing both modalities can degrade performance rather than improve it, highlighting fundamental weaknesses in current multimodal reasoning.
Benchmark at a Glance
The benchmark isolates a simple but cognitively demanding question: can a model recover the symbolic output of writing by reasoning over the physical process that produced it?
Standard, Cursive, Retrace
Audio, Muted Video, Audio+Video
Task Overview
Each sample presents handwriting with no visible ink. The only available clues are the micro-kinematics of the hand and the sound of the pen interacting with paper. Word-level samples are synthesized by concatenating live-recorded letter primitives within a single writing style, enabling scalable generation while preserving the original motion and audio characteristics.
Selected Samples
The examples below show the benchmark format across styles and modalities. Each row contains a muted video sample, its corresponding audio-only recording, and the synchronized audio-video version.
Cursive Style
Word sample: “above”Muted Video
Audio Only
Audio + Video
Live Word Recording
Natural sample: “above”Muted Video
Audio Only
Audio + Video
Retrace Style
Letter sample: “a”Muted Video
Audio Only
Audio + Video
Key Results
Ordered Letter Accuracy (OLA) from the paper shows a striking human-machine gap and a recurring multimodal fusion failure.
| Model | Audio | Muted Video | Audio + Video |
|---|---|---|---|
| Human | 19.47 | 80.78 | 77.01 |
| Qwen2.5-Omni | 7.83 | 5.03 | 3.95 |
| GPT-4o | 8.87 | 8.85 | - |
| Gemini 2.5 Pro | 10.04 | 8.71 | 9.07 |
| Gemini 2.5 Flash | 9.49 | 8.38 | 8.44 |
GPT-4o Audio+Video results are omitted in the paper because the evaluation setup did not support pre-merged AV files for that model.
Why This Benchmark Matters
The Unwritten Benchmark shifts evaluation away from recognizing explicit content and toward reasoning about a hidden outcome produced by a physical process. It highlights three persistent challenges for current multimodal systems: weak understanding of continuous motion, brittle cross-modal fusion, and limited causal reasoning over tightly synchronized audiovisual signals.
Citation
If you find the dataset or paper useful, please cite: