More

    A brand new, difficult AGI take a look at stumps most AI fashions


    The Arc Prize Foundation, a nonprofit co-founded by distinguished AI researcher François Chollet, introduced in a weblog put up on Tuesday that it has created a brand new, difficult take a look at to measure the overall intelligence of main AI fashions.

    So far, the brand new take a look at, known as ARC-AGI-2, has stumped most fashions.

    “Reasoning” AI fashions like OpenAI’s o1-pro and DeepSeek’s R1 rating between 1% and 1.3% on ARC-AGI-2, in accordance with the Arc Prize leaderboard. Powerful non-reasoning fashions together with GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash rating round 1%.

    The ARC-AGI exams include puzzle-like issues the place an AI has to establish visible patterns from a group of different-colored squares, and generate the proper “reply” grid. The issues have been designed to pressure an AI to adapt to new issues it hasn’t seen earlier than.

    The Arc Prize Foundation had over 400 folks take ARC-AGI-2 to ascertain a human baseline. On common, “panels” of those folks obtained 60% of the take a look at’s questions proper — significantly better than any of the fashions’ scores.

    a pattern query from Arc-AGI-2 (credit score: Arc Prize).

    In a put up on X, Chollet claimed ARC-AGI-2 is a greater measure of an AI mannequin’s precise intelligence than the primary iteration of the take a look at, ARC-AGI-1. The Arc Prize Foundation’s exams are geared toward evaluating whether or not an AI system can effectively purchase new abilities exterior the info it was educated on.

    Chollet mentioned that not like ARC-AGI-1, the brand new take a look at prevents AI fashions from counting on “brute pressure” — in depth computing energy — to search out options. Chollet beforehand acknowledged this was a significant flaw of ARC-AGI-1.

    To deal with the primary take a look at’s flaws, ARC-AGI-2 introduces a brand new metric: effectivity. It additionally requires fashions to interpret patterns on the fly as a substitute of counting on memorization.

    “Intelligence isn’t solely outlined by the power to unravel issues or obtain excessive scores,” Arc Prize Foundation co-founder Greg Kamradt wrote in a weblog put up. “The effectivity with which these capabilities are acquired and deployed is a vital, defining part. The core query being requested isn’t just, ‘Can AI purchase [the] talent to unravel a process?’ but in addition, ‘At what effectivity or price?’”

    ARC-AGI-1 was unbeaten for roughly 5 years till December 2024, when OpenAI launched its superior reasoning mannequin, o3, which outperformed all different AI fashions and matched human efficiency on the analysis. However, as we famous on the time, o3’s efficiency positive aspects on ARC-AGI-1 got here with a hefty price ticket.

    The model of OpenAI’s o3 mannequin — o3 (low) — that was first to succeed in new heights on ARC-AGI-1, scoring 75.7% on the take a look at, obtained a measly 4% on ARC-AGI-2 utilizing $200 price of computing energy per process.

    Comparison of Frontier AI mannequin efficiency on ARC-AGI-1 and ARC-AGI-2 (credit score: Arc Prize).

    The arrival of ARC-AGI-2 comes as many within the tech trade are calling for brand spanking new, unsaturated benchmarks to measure AI progress. Hugging Face’s co-founder, Thomas Wolf, lately informed TechCrunch that the AI trade lacks enough exams to measure the important thing traits of so-called synthetic normal intelligence, together with creativity.

    Alongside the brand new benchmark, the Arc Prize Foundation introduced a brand new Arc Prize 2025 contest, difficult builders to succeed in 85% accuracy on the ARC-AGI-2 take a look at whereas solely spending $0.42 per process.



    Source hyperlink

    Recent Articles

    spot_img

    Related Stories

    Leave A Reply

    Please enter your comment!
    Please enter your name here

    Stay on op - Ge the daily news in your inbox