Rank | Model | Dev | Test | |
---|---|---|---|---|
CORE-Dev | CORE-Test | ASSOC. | ||
-- WeChat AI |
Human Performance | 92.0±4.3 | 89.5±5.1 | 77.8±6.3 |
1 OpenAI |
GPT-4o
gpt-4o-2024-05-13 TEXT VISION |
52.3±0.8 | 45.2±2.3 | 39.5±1.1 |
-- OpenAI |
o1*
o1-preview-2024-09-12 TEXT |
42.0 | -- | -- |
2 OpenAI |
GPT-4
gpt-4-turbo-2024-04-09 TEXT |
41.3±1.3 | 28.2±2.3 | 38.3±1.2 |
3 OpenAI |
GPT-4
gpt-4-turbo-2024-04-09 TEXT VISION |
34.2±1.6 | 28.7±2.4 | 32.0±1.5 |
4 OpenAI |
GPT-4o
gpt-4o-2024-05-13 TEXT |
34.0±2.9 | 31.3±2.9 | 35.5±2.5 |
5 OpenAI |
GPT-3.5
gpt-3.5-turbo-1106 TEXT |
26.5±2.5 | 24.4±0.8 | 30.0±2.5 |
6 Shanghai AI Lab |
InternVL TEXT VISION OPEN | 26.3±1.6 | 26.9±4.1 | 24.8±1.3 |
7 UW Madison |
LLaVA TEXT VISION OPEN | 26.2±1.1 | 28.5±1.5 | 24.7±3.2 |
- |
Random | 25 | ||
8 Meta AI |
LLaMA TEXT OPEN | 23.5±2.5 | 27.3±0.6 | 21.7±2.0 |
9 Mistral AI |
Mistral TEXT OPEN | 21.5±0.3 | 26.0±1.4 | 23.2±0.4 |