More

    Did xAI lie about Grok 3’s benchmarks?


    Debates over AI benchmarks — and the way they’re reported by AI labs — are spilling out into public view.

    This week, an OpenAI worker accused Elon Musk’s AI firm, xAI, of publishing deceptive benchmark outcomes for its newest AI mannequin, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the corporate was in the correct.

    The fact lies someplace in between.

    In a put up on xAI’s weblog, the corporate revealed a graph exhibiting Grok 3’s efficiency on AIME 2025, a group of difficult math questions from a latest invitational arithmetic examination. Some consultants have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older variations of the take a look at are generally used to probe a mannequin’s math capacity.

    xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing out there mannequin, o3-mini-high, on AIME 2025. But OpenAI staff on X had been fast to level out that xAI’s graph didn’t embody o3-mini-high’s AIME 2025 rating at “cons@64.”

    What is cons@64, you may ask? Well, it’s brief for “consensus@64,” and it mainly offers a mannequin 64 tries to reply every downside in a benchmark and takes the solutions generated most ceaselessly as the ultimate solutions. As you may think about, cons@64 tends to spice up fashions’ benchmark scores fairly a bit, and omitting it from a graph may make it seem as if one mannequin surpasses one other when in actuality, that’s isn’t the case.

    Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — that means the primary rating the fashions obtained on the benchmark — fall under o3-mini-high’s rating. Grok 3 Reasoning Beta additionally trails ever-so-slightly behind OpenAI’s o1 mannequin set to “medium” computing. Yet xAI is promoting Grok 3 because the “world’s smartest AI.”

    Babushkin argued on X that OpenAI has revealed equally deceptive benchmark charts previously — albeit charts evaluating the efficiency of its personal fashions. A extra impartial occasion within the debate put collectively a extra “correct” graph exhibiting practically each mannequin’s efficiency at cons@64:

    But as AI researcher Nathan Lambert identified in a put up, maybe a very powerful metric stays a thriller: the computational (and financial) price it took for every mannequin to attain its greatest rating. That simply goes to indicate how little most AI benchmarks talk about fashions’ limitations — and their strengths.





    Source hyperlink

    Recent Articles

    spot_img

    Related Stories

    Leave A Reply

    Please enter your comment!
    Please enter your name here

    Stay on op - Ge the daily news in your inbox