Vision AI Checkup
See how 20+ vision-language models perform on dozens of real-world tasks.
Run on 89 prompts.
Model
Score
Avg Time / Prompt
#1
OpenAI O4 Mini
79.3%
25.90s
#2
ChatGPT-4o
78.0%
13.60s
#3
OpenAI O3
76.8%
17.77s
#3
OpenAI o3-pro
76.8%
39.50s
#4
GPT-4.1 Mini
75.6%
13.60s
#5
GPT-4.1
73.2%
14.91s
#6
Gemini 2.5 Pro Preview
70.7%
18.25s
#6
Claude 4 Sonnet
70.7%
13.67s
#7
Claude 3.7 Sonnet
68.3%
16.52s
#7
Llama 4 Maverick 17B

68.3%
2.51s
Explore Prompts
Explore the prompts we run as part of the Vision AI Checkup.
(p.s.: you can add your own too!)
Is the glass rim cracked? Answer only yes or no.

How wide is the sticker in inches? Return only a real number.

How many bottles are in the image? Answer only a number

What date is picked on the calendar? Answer like January 1 2020

How much tax was paid? Only answer like $1.00

What is the serial number on the tire? Answer only the serial number.

About Vision AI Checkup
Vision AI Checkup measures how well new multimodal models perform at real world use cases.
Our assessment consists of dozens of images, questions, and answers that we benchmark against models. We run the checkup every time we add a new model to the leaderboard.
You can use the Vision AI checkup to gauge how well a model does generally, without having to understand a complex benchmark with thousands of data points.
The assessment and models are constantly evolving. This means that as more tasks get added or models receive updates, we can build a clearer picture of the current state-of-the-art models in real-time.
Contribute a Prompt
Have an idea for a prompt? Submit it to the project repository on GitHub!