As other comments point out this is about harness development and harness efficiency. Agentica SDK is a sort of meta harness, that makes things easy: plug any "internal API" (as defined natively in your codebase) directly into your agent. Agentica SDK itself is not application specifc; but the APIs of your application are... application specific.
Re: the linked prompt. A harness is a set of tools and descriptions how to best use those tools, and sometimes some external control flow based on the outcome of using those tools. How to "best use the tools" should always be part of the prompt (like in this case).
So this work tries to answer: "short of telling the agent any solutions, make a simple but efficient API to play the games, hand it to the agent, and see how it does". In the world of harness development I think that's an interesting question to answer!
The challenge isn't about harness development though, and a sufficiently complex harness can solve these tasks rather easily.
And presenting it as if you've made a novel development for solving ARC-AGI-3 leads me to believe you're willing to waste all of our time for your benefit at every step in the future.
The hard part of these tests isn't purely reasoning ability ffs.
This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.
The wrench is to the mechanic as the stock python repl is to the LLM.
In fact the LLMs that everyone uses today typically have access to specialized task specific tooling. Obviously specialized tools aren't appropriate for a test that measures the ability to generalize but generic tools are par for the course. Writing a bot to play a game for you would certainly serve to demonstrate an understanding of the task.
EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:
> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
Not sure if the specific rules of this prize allow that, but I would accept that
Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
This is the state of "AI" these days I guess...
[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
I wish we'd move past public test sets for LLM benchmarks: publish a plain english explanation of the tasks, allow questions and clarifications, and but never release a single question from the test set verbatim.
It made sense back when models needed to be finetuned on the task to even reliably answer. If we're saying this is the path to AGI we should be able to rely on the generalization of the model to get it right.
If a model was trained on <|begin_text|> <|end_text|> and you change the tokens passed to <|start_text|> <|end_text|>, it loses several 'IQ points' if it can even answer back at all anymore.
Synthetic data is fine. Synthetic data on very similar questions generated based on the description is typically fine. But once the shape of what you're training on gets too close to the actual holdout questions, you're getting an uplift that's not realistic for unseen tasks.
According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461