🦜 The Stochastic Parrot on LLM's Shoulder:
A Summative Assessment of Physical Concept Understanding
1 WeChat AI 2 HKUST 3 JHU
= equal contribution
[Paper] [Leaderboard]
Illustration of a "Stochastic Parrot 🦜" by our PhysiCo task consisting of both low-level and high-level subtasks in parallel. For a concept Gravity, an LLM can generate its accurate description in natural language, but cannot interpret its grid-format illustration.

Abstract

In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates that:

  1. state-of-the-art LLMs lag behind humans by ~40%;
  2. the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language;
  3. our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.