Yesterday, OpenAI announced their new o3 model. One of the big claims was reaching the very high performance of 76% on ARC-AGI-SemiPub,and as high as 91% on the public set (after spending $1.6 million dollars on compute just to run inference on this one benchmark - about $3,000 per task, in addition to fine-tuning costs).1
Since this post, I’ve published an article deep-diving the advances that o3 made, its mistakes, and how ARC may be a misleading metric for LLM reasoning:
So, of the 400 tasks in ARC-AGI-Pub, here are the 34 tasks that o3 was unable to solve, along with the incorrect guesses it made:
You can check out the data for yourself here, with visualisations made with my arckit library. I’d love to hear your thoughts!
As an interesting aside, here’s an example of how the first generation of LLMs did on ARC (from our previous work, Neural Networks for Abstraction & Reasoning). As model scale improves, we see better and better abilities to avoid mistakes, even if some of the reasoning is there from the beginning.
The ARC organisers impose a inference limit of $10,000 on the Semi-public set, so that explains most of the lower score. It’s worth noting that neither of these sets represent the actual ARC prize, which requires much more restricted compute.
You missed one, which I also missed initially https://o3-failed-arc-agi.vercel.app/
Failed task 8b28cd80
Here, the task had 2 tests, not just 1
Out of 400 task, 19 task had 2 test
Thanks for doing this -- very useful.