o3 and ARC-AGI: The unsolved tasks

Mikel Bober-Irizar

Dec 21, 2024

The 34 puzzles that money can't buy.

Read →

10 Comments

skillman1337

Dec 22Edited

You missed one, which I also missed initially https://o3-failed-arc-agi.vercel.app/

Failed task 8b28cd80

Here, the task had 2 tests, not just 1

Out of 400 task, 19 task had 2 test

Expand full comment

Reply (1)

Mikel Bober-Irizar

Dec 22

Yes! I was wondering why none of the 2-test showed up in my output. Thank you, I'll add it

Expand full comment

Melanie Mitchell

Dec 22

Thanks for doing this -- very useful.

Expand full comment

Anastasia Courtney

Dec 22

As the only task in the entire dataset that o3 produced a non-attempt for, I'd be wondering whether there's a content filter that made the second attempt on Task da515329 short-circuit.

Expand full comment

Lars Pedersen

Dec 22

Are the tasks really in 1D? That would explain a lot. Perhaps allow the models to use image mode, they are already skilled in that.

Expand full comment

Reply (1)

Mikel Bober-Irizar

Dec 22

Yep, they're fed into GPT as long strings of digits separated by newlines (for new rows of the image). Especially with smaller models, this is a big dampener on performance.

Expand full comment

Odd anon

Dec 24

its first guess on the 4th question could also be reasonably seen as correct (it's counting visible rows rather than "implied" rows), though I don't think most people would answer like that. for two more of those (10 and 13), I don't think it "failed" as much as it just refused to give the answer that prominently includes a swastika. :/

Expand full comment

CDorney

Dec 23Edited

Great post! Very interesting.

For the first one that's being going around Twitter.

It's clear there is more than one 'correct' answer. My read was that's why you get two attempts. If your first 'correct' answer is 'incorrect' you're supposed to then put forward the next most likely 'correct' answer. In this sense I think o3 fails. It's second response is way off. Instead of switching the glaring ambiguity (do adjacent squares switch colour?), its solution 'fixes' something which wasn't ambiguous at all.

Of course, my read could be way off itself

Expand full comment

Nils

Dec 22

Very interesting! I think it would make more sense to see, whether a model can generate a sufficiently general rule from the three examples, than to make it draw something.

We just don't know, which rule o3 generated and whether it might be valid, too.

Maybe, we also need some thinking on how to measure 'sufficiently general' ...

Expand full comment

Alexander Naumenko

Dec 22

I don't believe in LLMs - they are a waste of big money. Natural intelligence is more efficient, which suggests that a different algorithm is in action.

Consider reading this - https://alexandernaumenko.substack.com/p/algorithm-for-arc-challenge

I am not good at coding, if you think it's a promising approach, I will be happy to share everything else I have so far. If we can finish it together it may be a breakthrough Chollet is looking for.

Expand full comment