Anthropic used Pokémon to benchmark its latest AI mannequin

Anthropic used Pokémon to benchmark its latest AI mannequin. Yes, actually.

In a weblog submit printed Monday, Anthropic stated that it examined its newest mannequin, Claude 3.7 Sonnet, on the Game Boy basic Pokémon Red. The firm geared up the mannequin with primary reminiscence, display screen pixel enter, and performance calls to press buttons and navigate across the display screen, permitting it to play Pokémon constantly.

A novel characteristic of Claude 3.7 Sonnet is its means to interact in “prolonged pondering.” Like OpenAI’s o3-mini and DeepSeek’s R1, Claude 3.7 Sonnet can “purpose” via difficult issues by making use of extra computing — and taking extra time.

That got here in useful in Pokémon Red, apparently.

Compared to a earlier model of Claude, Claude 3.0 Sonnet, which didn’t depart the home in Pallet Town the place the story begins, Claude 3.7 Sonnet efficiently battled three Pokémon gymnasium leaders and gained their badges.

Image Credits:Anthropic

Now, it’s not clear how a lot computing was required for Claude 3.7 Sonnet to succeed in these milestones — and the way lengthy every took. Anthropic solely stated that the mannequin carried out 35,000 actions to succeed in the final gymnasium chief, Surge.

It certainly gained’t be lengthy earlier than some enterprising developer finds out.

Pokémon Red is extra of a toy benchmark than something. However, there is an extended historical past of video games getting used for AI benchmarking functions. In the previous few months alone, quite a few new apps and platforms have cropped as much as check fashions’ game-playing talents on titles starting from Street Fighter to Pictionary.

Source hyperlink

Anthropic used Pokémon to benchmark its latest AI mannequin

Recent Articles

Chegg sues Google over AI search summaries

Freedom of speech is ‘on the road’ in a pivotal Dakota Access Pipeline trial

Apple exec Phil Schiller testifies that he raised issues over App Store commissions on web-based gross sales

SpaceX thinks it is aware of why Starship exploded on its final check flight

DOGE’s HR e-mail is getting the ‘Bee Movie’ spam remedy

Related Stories

Leave A Reply Cancel reply

Stay on op - Ge the daily news in your inbox