10 Comments

You missed one, which I also missed initially https://o3-failed-arc-agi.vercel.app/

Failed task 8b28cd80

Here, the task had 2 tests, not just 1

Out of 400 task, 19 task had 2 test

Expand full comment

Yes! I was wondering why none of the 2-test showed up in my output. Thank you, I'll add it

Expand full comment

Thanks for doing this -- very useful.

Expand full comment

As the only task in the entire dataset that o3 produced a non-attempt for, I'd be wondering whether there's a content filter that made the second attempt on Task da515329 short-circuit.

Expand full comment

Are the tasks really in 1D? That would explain a lot. Perhaps allow the models to use image mode, they are already skilled in that.

Expand full comment

Yep, they're fed into GPT as long strings of digits separated by newlines (for new rows of the image). Especially with smaller models, this is a big dampener on performance.

Expand full comment

its first guess on the 4th question could also be reasonably seen as correct (it's counting visible rows rather than "implied" rows), though I don't think most people would answer like that. for two more of those (10 and 13), I don't think it "failed" as much as it just refused to give the answer that prominently includes a swastika. :/

Expand full comment

Great post! Very interesting.

For the first one that's being going around Twitter.

It's clear there is more than one 'correct' answer. My read was that's why you get two attempts. If your first 'correct' answer is 'incorrect' you're supposed to then put forward the next most likely 'correct' answer. In this sense I think o3 fails. It's second response is way off. Instead of switching the glaring ambiguity (do adjacent squares switch colour?), its solution 'fixes' something which wasn't ambiguous at all.

Of course, my read could be way off itself

Expand full comment

Very interesting! I think it would make more sense to see, whether a model can generate a sufficiently general rule from the three examples, than to make it draw something.

We just don't know, which rule o3 generated and whether it might be valid, too.

Maybe, we also need some thinking on how to measure 'sufficiently general' ...

Expand full comment

I don't believe in LLMs - they are a waste of big money. Natural intelligence is more efficient, which suggests that a different algorithm is in action.

Consider reading this - https://alexandernaumenko.substack.com/p/algorithm-for-arc-challenge

I am not good at coding, if you think it's a promising approach, I will be happy to share everything else I have so far. If we can finish it together it may be a breakthrough Chollet is looking for.

Expand full comment