🦜 The Stochastic Parrot on LLM's Shoulder:
A Summative Assessment of Physical Concept Understanding
1 WeChat AI 2 HKUST 3 JHU
= equal contribution
[Paper] [Webpage]
Rank Model Dev Test
CORE-Dev CORE-Test ASSOC.

--

WeChat AI

Human Performance -- 92.0±4.3 77.8±6.3

1

OpenAI

GPT-4o

gpt-4o-2024-05-13

TEXT VISION
45.2±2.3 52.3±0.8 39.5±1.1

--

OpenAI

o1*

o1-preview-2024-09-12

TEXT
-- 42.0 --

2

OpenAI

GPT-4

gpt-4-turbo-2024-04-09

TEXT
28.2±2.3 41.3±1.3 38.3±1.2

3

OpenAI

GPT-4

gpt-4-turbo-2024-04-09

TEXT VISION
28.7±2.4 34.2±1.6 32.0±1.5

4

OpenAI

GPT-4o

gpt-4o-2024-05-13

TEXT
31.3±2.9 34.0±2.9 35.5±2.5

5

OpenAI

GPT-3.5

gpt-3.5-turbo-1106

TEXT
24.4±0.8 26.5±2.5 30.0±2.5

6

Shanghai AI Lab

InternVL

🤗 InternVL-Chat-V1-5

(Chen et al., 2024)

TEXT VISION OPEN
26.9±4.1 26.3±1.6 24.8±1.3

7

UW Madison

LLaVA

🤗 llava-v1.6-34b

(Liu et al., 2024)

TEXT VISION OPEN
28.5±1.5 26.2±1.1 24.7±3.2

-

Random 25

8

Meta AI

LLaMA

🤗 Meta-Llama-3-8B-Instruct

(Meta AI, 2024)

TEXT OPEN
27.3±0.6 23.5±2.5 21.7±2.0

9

Mistral AI

Mistral

🤗 Mistral-7B-Instruct-v0.2

(Jiang et al., 2023)

TEXT OPEN
26.0±1.4 21.5±0.3 23.2±0.4