Rank | Model | Dev | Test | |
---|---|---|---|---|
CORE-Dev | CORE-Test | ASSOC. | ||
Sorted by |
Human | 92.0±4.3 | 89.5±5.1 | 77.8±6.3 |
1 OpenAI |
o3-mini (high)
o3-mini-2025-01-31 (high) TEXT THINK |
46.0 | 46.5 | 42.5 |
2 OpenAI |
GPT-4o
gpt-4o-2024-05-13 MULTI-MODAL |
52.3±0.8 | 45.2±2.3 | 39.5±1.1 |
3 DeepSeek |
DeepSeek R1
deepseek-reasoner TEXT THINK OPEN |
41.5 | 29.5 | 55.0 |
4 Google DeepMind |
Gemini-2.0 Flash Thinking
gemini-2.0-flash-thinking-exp MULTI-MODAL THINK |
49.8±0.8 | 43.2±2.0 | 36.8±3.1 |
5 OpenAI |
o1
o1-2024-12-17 MULTI-MODAL THINK |
53.0 | 42.5 | 34.5 |
6 OpenAI |
GPT-4o
gpt-4o-2024-05-13 TEXT |
34.0±2.9 | 31.3±2.9 | 35.5±2.5 |
7 OpenAI |
GPT-4
gpt-4-turbo-2024-04-09 TEXT |
41.3±1.3 | 28.2±2.3 | 38.3±1.2 |
8 OpenAI |
GPT-4
gpt-4-turbo-2024-04-09 MULTI-MODAL |
34.2±1.6 | 28.7±2.4 | 32.0±1.5 |
9 OpenAI |
GPT-3.5
gpt-3.5-turbo-1106 TEXT |
26.5±2.5 | 24.4±0.8 | 30.0±2.5 |
10 UW Madison |
LLaVA MULTI-MODAL OPEN | 26.2±1.1 | 28.5±1.5 | 24.7±3.2 |
11 Shanghai AI Lab |
InternVL MULTI-MODAL OPEN | 26.3±1.6 | 26.9±4.1 | 24.8±1.3 |
-- |
Random | 25 | ||
12 Mistral AI |
Mistral TEXT OPEN | 21.5±0.3 | 26.0±1.4 | 23.2±0.4 |
13 Meta AI |
LLaMA 3 TEXT OPEN | 23.5±2.5 | 27.3±0.6 | 21.7±2.0 |