Over the weekend, Meta dropped two new Llama 4 fashions: a smaller mannequin named Scout, and Maverick, a mid-size mannequin that the corporate claims can beat GPT-4o and Gemini 2.0 Flash “throughout a broad vary of broadly reported benchmarks.”
Maverick rapidly secured the number-two spot on LMArena, the AI benchmark web site the place people evaluate outputs from totally different programs and vote on the perfect one. In Meta’s press launch, the corporate highlighted Maverick’s ELO rating of 1417, which positioned it above OpenAI’s 4o and slightly below Gemini 2.5 Pro. (A better ELO rating means the mannequin wins extra usually within the enviornment when going head-to-head with rivals.)
The achievement appeared to place Meta’s open-weight Llama 4 as a critical challenger to the state-of-the-art, closed fashions from OpenAI, Anthropic, and Google. Then, AI researchers digging via Meta’s documentation found one thing uncommon.
In wonderful print, Meta acknowledges that the model of Maverick examined on LMArena isn’t the identical as what’s obtainable to the general public. According to Meta’s personal supplies, it deployed an “experimental chat model” of Maverick to LMArena that was particularly “optimized for conversationality,” TechCrunch first reported.
“Meta’s interpretation of our coverage didn’t match what we anticipate from mannequin suppliers,” LMArena posted on X two days after the mannequin’s launch. “Meta ought to have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a custom-made mannequin to optimize for human desire. As a results of that, we’re updating our leaderboard insurance policies to bolster our dedication to honest, reproducible evaluations so this confusion doesn’t happen sooner or later.“
A spokesperson for Meta didn’t have a response to LMArena’s assertion in time for publication.
While what Meta did with Maverick isn’t explicitly towards LMArena’s guidelines, the location has shared issues about gaming the system and brought steps to “stop overfitting and benchmark leakage.” When firms can submit specially-tuned variations of their fashions for testing whereas releasing totally different variations to the general public, benchmark rankings like LMArena grow to be much less significant as indicators of real-world efficiency.
”It’s probably the most broadly revered basic benchmark as a result of all the different ones suck,” unbiased AI researcher Simon Willison tells The Verge. “When Llama 4 got here out, the truth that it got here second within the enviornment, simply after Gemini 2.5 Pro — that basically impressed me, and I’m kicking myself for not studying the small print.”
Shortly after Meta launched Maverick and Scout, the AI group began speaking a couple of rumor that Meta had additionally skilled its Llama 4 fashions to carry out higher on benchmarks whereas hiding their actual limitations. VP of generative AI at Meta, Ahmad Al-Dahle, addressed the accusations in a publish on X: “We’ve additionally heard claims that we skilled on take a look at units — that’s merely not true and we might by no means try this. Our finest understanding is that the variable high quality persons are seeing is because of needing to stabilize implementations.”
“It’s a really complicated launch usually.”
Some additionally seen that Llama 4 was launched at an odd time. Saturday doesn’t are usually when large AI information drops. After somebody on Threads requested why Llama 4 was launched over the weekend, Meta CEO Mark Zuckerberg replied: “That’s when it was prepared.”
“It’s a really complicated launch usually,” says Willison, who carefully follows and paperwork AI fashions. “The mannequin rating that we acquired there’s utterly nugatory to me. I can’t even use the mannequin that they acquired a excessive rating on.”
Meta’s path to releasing Llama 4 wasn’t precisely clean. According to a current report from The Information, the corporate repeatedly pushed again the launch because of the mannequin failing to fulfill inside expectations. Those expectations are particularly excessive after DeepSeek, an open-source AI startup from China, launched an open-weight mannequin that generated a ton of buzz.
Ultimately, utilizing an optimized mannequin in LMArena places builders in a troublesome place. When deciding on fashions like Llama 4 for his or her functions, they naturally look to benchmarks for steering. But as is the case for Maverick, these benchmarks can mirror capabilities that aren’t truly obtainable within the fashions that the general public can entry.
As AI growth accelerates, this episode reveals how benchmarks have gotten battlegrounds. It additionally reveals how Meta is keen to be seen as an AI chief, even when meaning gaming the system.