LLMs struggle with perception, not reasoning…

Dec 24, 2024

What made o3 so much better than previous models on this benchmark?

4 Comments

I love this point "giving 2D ARC tasks to an LLM is like expecting humans to perform reasoning in 4D". I wonder (not really) what human performance on ARC would be if they didn't see the puzzle as a 2d picture, but as a sequence of numbers or a sequence of 1d pictures.

I also could easily see humans failing on hypothetical 3d ARC tasks if their representation is not convenient enough.

Expand full comment

Neel Gupta

Jan 10

There's a simpler explanation; larger problems require a larger program search space. Adaptive computation is an unsolved problem and LLMs are no exception - thus the model is limited by how well it can narrow down candidate solutions in CoT and effectively act on them.

o3 style models are "significantly better" at this by expending exponential amount of compute per problem, because they aren't "solving" it per se but rather trying to search and guess a set of heuristics to correctly solve the task at hand.

So no, perception is not the bottleneck here. Its the lack of an ACT-styled mechanism. ARC is just the simplest way to encode such family of problems, which just happens to be spatial in nature.

Expand full comment

Kai Thomas

Dec 30

One thing I'd be curious about is how changing the way the ARC problem is presented affects the solution curves. I'd guess that if one found a better format for the data (potentially as simple as allowing for column major and row major presentations), that the accuracy curves would follow the same shape as you've described here.

I don't think it's particularly important, but it would be interesting (and maybe say something about scaffolding?)

Expand full comment

Cyberneticist

Dec 26

Not even wrong.

Expand full comment

Mikel Bober-Irizar

LLMs struggle with perception, not reasoning…