Chinese AI lab DeepSeek is perhaps getting the majority of the tech trade’s consideration this week. But one in every of its prime home rivals, Alibaba, isn’t sitting idly by.
Alibaba’s Qwen group on Monday launched a brand new household of AI fashions, Qwen2.5-VL, that may carry out quite a few textual content and picture evaluation duties. The fashions can parse recordsdata, perceive movies, and rely objects in photos, in addition to management a PC — much like the mannequin powering OpenAI’s just lately launched Operator.
Per the Qwen group’s benchmarking, the very best Qwen2.5-VL mannequin beats OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash on a spread of video understanding, math, doc evaluation, and question-answering evaluations.
Qwen2.5-VL, which is offered to check in Alibaba’s Qwen Chat app and to obtain from AI dev platform Hugging Face, can analyze charts and graphics, extract information from scans of invoices and kinds, and “comprehend” multiple-hours-long movies, the Qwen group says. Qwen2.5-VL also can acknowledge “IPs from movie and TV sequence, in addition to all kinds of merchandise,” per the group — suggesting that the fashions would possibly’ve been educated partially on copyrighted works.
Qwen2.5-VL, being AI developed by a Chinese firm, has sure restrictions on the subjects it is going to focus on — at the least in Qwen Chat. When I requested the biggest and most succesful Qwen2.5-VL mannequin, Qwen2.5-VL-72B, to speak about “Xi Jinping’s errors,” Qwen Chat threw an error message.
China’s web regulator benchmarks many fashions developed within the nation to make sure their responses “embody core socialist values.” Many Chinese AI methods decline to answer subjects that may increase the ire of regulators, reminiscent of Taiwan’s autonomy.
One of Qwen2.5-VL’s extra attention-grabbing options is its means to work together with software program — each on PCs and cell gadgets. A video posted on X by Philipp Schmid, a technical lead at Hugging Face, confirmed Qwen2.5-VL launching the Booking.com app for Android and reserving a flight from Chongqing to Beijing.
Don’t Miss @Alibaba_Qwen 2.5 VL! Despite all of the Deepseek Hype, Qwen simply dropped the very best open Multimodal! Qwen 2.5 VL is a Vision Language Model that may management your laptop, much like the @OpenAI operator, extract structured info from charts, and extra!!
TL;DR;
3️⃣… pic.twitter.com/GeEGVdl0tI— Philipp Schmid (@_philschmid) January 27, 2025
In the video beneath, a Qwen2.5-VL mannequin controls apps on a Linux desktop — however doesn’t appear to perform a lot past switching tabs. Perhaps tellingly, Qwen’s benchmarking exhibits Qwen2.5-VL scoring poorly on OSWorld, a benchmark that tries to imitate an actual laptop surroundings.
LMAO Qwen 2.5 VL can carry out Computer Use, out of the field, taking over OpenAI Operator HEAD ON! 🐐 pic.twitter.com/lwMECXzNSu
— Vaibhav (VB) Srivastav (@reach_vb) January 27, 2025
The two smaller, much less refined fashions within the Qwen2.5-VL sequence, Qwen2.5-VL-3B and Qwen2.5-VL-7B, can be found below a permissive license. The flagship Qwen2.5-VL-72B, nevertheless, is below Alibaba’s customized license, which requires that firms and devs with greater than 100 million month-to-month lively customers request permission from Qwen/Alibaba earlier than deploying the mannequin commercially.