As the only task in the entire dataset that o3 produced a non-attempt for, I'd be wondering whether there's a content filter that made the second attempt on Task da515329 short-circuit.
Yep, they're fed into GPT as long strings of digits separated by newlines (for new rows of the image). Especially with smaller models, this is a big dampener on performance.
its first guess on the 4th question could also be reasonably seen as correct (it's counting visible rows rather than "implied" rows), though I don't think most people would answer like that. for two more of those (10 and 13), I don't think it "failed" as much as it just refused to give the answer that prominently includes a swastika. :/
For the first one that's being going around Twitter.
It's clear there is more than one 'correct' answer. My read was that's why you get two attempts. If your first 'correct' answer is 'incorrect' you're supposed to then put forward the next most likely 'correct' answer. In this sense I think o3 fails. It's second response is way off. Instead of switching the glaring ambiguity (do adjacent squares switch colour?), its solution 'fixes' something which wasn't ambiguous at all.
Very interesting! I think it would make more sense to see, whether a model can generate a sufficiently general rule from the three examples, than to make it draw something.
We just don't know, which rule o3 generated and whether it might be valid, too.
Maybe, we also need some thinking on how to measure 'sufficiently general' ...
I don't believe in LLMs - they are a waste of big money. Natural intelligence is more efficient, which suggests that a different algorithm is in action.
I am not good at coding, if you think it's a promising approach, I will be happy to share everything else I have so far. If we can finish it together it may be a breakthrough Chollet is looking for.
You missed one, which I also missed initially https://o3-failed-arc-agi.vercel.app/
Failed task 8b28cd80
Here, the task had 2 tests, not just 1
Out of 400 task, 19 task had 2 test
Yes! I was wondering why none of the 2-test showed up in my output. Thank you, I'll add it
Thanks for doing this -- very useful.
As the only task in the entire dataset that o3 produced a non-attempt for, I'd be wondering whether there's a content filter that made the second attempt on Task da515329 short-circuit.
Are the tasks really in 1D? That would explain a lot. Perhaps allow the models to use image mode, they are already skilled in that.
Yep, they're fed into GPT as long strings of digits separated by newlines (for new rows of the image). Especially with smaller models, this is a big dampener on performance.
its first guess on the 4th question could also be reasonably seen as correct (it's counting visible rows rather than "implied" rows), though I don't think most people would answer like that. for two more of those (10 and 13), I don't think it "failed" as much as it just refused to give the answer that prominently includes a swastika. :/
Great post! Very interesting.
For the first one that's being going around Twitter.
It's clear there is more than one 'correct' answer. My read was that's why you get two attempts. If your first 'correct' answer is 'incorrect' you're supposed to then put forward the next most likely 'correct' answer. In this sense I think o3 fails. It's second response is way off. Instead of switching the glaring ambiguity (do adjacent squares switch colour?), its solution 'fixes' something which wasn't ambiguous at all.
Of course, my read could be way off itself
Very interesting! I think it would make more sense to see, whether a model can generate a sufficiently general rule from the three examples, than to make it draw something.
We just don't know, which rule o3 generated and whether it might be valid, too.
Maybe, we also need some thinking on how to measure 'sufficiently general' ...
I don't believe in LLMs - they are a waste of big money. Natural intelligence is more efficient, which suggests that a different algorithm is in action.
Consider reading this - https://alexandernaumenko.substack.com/p/algorithm-for-arc-challenge
I am not good at coding, if you think it's a promising approach, I will be happy to share everything else I have so far. If we can finish it together it may be a breakthrough Chollet is looking for.