🦜 The Stochastic Parrot on LLM's Shoulder:
A Summative Assessment of Physical Concept Understanding
1 WeChat AI 2 HKUST 3 JHU
= equal contribution
Rank Model Dev Test
CORE-Dev CORE-Test ASSOC.

Sorted by
Test Average

Human 92.0±4.3 89.5±5.1 77.8±6.3

1

OpenAI

o3-mini (high)

o3-mini-2025-01-31 (high)

TEXT THINK
46.0 46.5 42.5

2

OpenAI

GPT-4o

gpt-4o-2024-05-13

MULTI-MODAL
52.3±0.8 45.2±2.3 39.5±1.1

3

DeepSeek

DeepSeek R1

deepseek-reasoner

TEXT THINK OPEN
41.5 29.5 55.0

4

Google DeepMind

Gemini-2.0 Flash Thinking

gemini-2.0-flash-thinking-exp

MULTI-MODAL THINK
49.8±0.8 43.2±2.0 36.8±3.1

5

OpenAI

o1

o1-2024-12-17

MULTI-MODAL THINK
53.0 42.5 34.5

6

OpenAI

GPT-4o

gpt-4o-2024-05-13

TEXT
34.0±2.9 31.3±2.9 35.5±2.5

7

OpenAI

GPT-4

gpt-4-turbo-2024-04-09

TEXT
41.3±1.3 28.2±2.3 38.3±1.2

8

OpenAI

GPT-4

gpt-4-turbo-2024-04-09

MULTI-MODAL
34.2±1.6 28.7±2.4 32.0±1.5

9

OpenAI

GPT-3.5

gpt-3.5-turbo-1106

TEXT
26.5±2.5 24.4±0.8 30.0±2.5

10

UW Madison

LLaVA

🤗 llava-v1.6-34b

(Liu et al., 2024)

MULTI-MODAL OPEN
26.2±1.1 28.5±1.5 24.7±3.2

11

Shanghai AI Lab

InternVL

🤗 InternVL-Chat-V1-5

(Chen et al., 2024)

MULTI-MODAL OPEN
26.3±1.6 26.9±4.1 24.8±1.3

--

Random 25

12

Mistral AI

Mistral

🤗 Mistral-7B-Instruct-v0.2

(Jiang et al., 2023)

TEXT OPEN
21.5±0.3 26.0±1.4 23.2±0.4

13

Meta AI

LLaMA 3

🤗 Meta-Llama-3-8B-Instruct

(Meta AI, 2024)

TEXT OPEN
23.5±2.5 27.3±0.6 21.7±2.0