🦜 The Stochastic Parrot on LLM's Shoulder:
A Summative Assessment of Physical Concept Understanding
1 WeChat AI 2 HKUST 3 JHU
= equal contribution
[Paper] [Webpage]
Rank Model Dev Test
CORE-Dev CORE-Test ASSOC.

--

WeChat AI

Human Performance 92.0±4.3 89.5±5.1 77.8±6.3

1

OpenAI

GPT-4o

gpt-4o-2024-05-13

TEXT VISION
52.3±0.8 45.2±2.3 39.5±1.1

--

OpenAI

o1*

o1-preview-2024-09-12

TEXT
42.0 -- --

2

OpenAI

GPT-4

gpt-4-turbo-2024-04-09

TEXT
41.3±1.3 28.2±2.3 38.3±1.2

3

OpenAI

GPT-4

gpt-4-turbo-2024-04-09

TEXT VISION
34.2±1.6 28.7±2.4 32.0±1.5

4

OpenAI

GPT-4o

gpt-4o-2024-05-13

TEXT
34.0±2.9 31.3±2.9 35.5±2.5

5

OpenAI

GPT-3.5

gpt-3.5-turbo-1106

TEXT
26.5±2.5 24.4±0.8 30.0±2.5

6

Shanghai AI Lab

InternVL

🤗 InternVL-Chat-V1-5

(Chen et al., 2024)

TEXT VISION OPEN
26.3±1.6 26.9±4.1 24.8±1.3

7

UW Madison

LLaVA

🤗 llava-v1.6-34b

(Liu et al., 2024)

TEXT VISION OPEN
26.2±1.1 28.5±1.5 24.7±3.2

-

Random 25

8

Meta AI

LLaMA

🤗 Meta-Llama-3-8B-Instruct

(Meta AI, 2024)

TEXT OPEN
23.5±2.5 27.3±0.6 21.7±2.0

9

Mistral AI

Mistral

🤗 Mistral-7B-Instruct-v0.2

(Jiang et al., 2023)

TEXT OPEN
21.5±0.3 26.0±1.4 23.2±0.4