ElevenLabs, an AI startup that simply raised a $180 million mega funding spherical, has been primarily identified for its audio era prowess. The firm took a step in one other technological course by launching its first standalone speech-to-text mannequin known as Scribe.
The startup, valued at $3.3 billion, has aided many different firms in offering speech-to-text companies by its huge library of voices. However, the corporate is now trying to get into speech detection and compete with the likes of Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper fashions.
ElevenLabs’ Scribe mannequin helps over 99 languages at launch. The firm categorizes over 25 languages in wonderful accuracy class for the mannequin the place the phrase error price is lower than 5%. This checklist consists of English (claimed accuracy price of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Other languages are ranked in numerous classes with excessive (5-10% phrase error price), good (10 to twenty% phrase error price), and reasonable (25 to 50%) phrase error charges.
The firm stated that the mannequin outperformed Google Gemini 2.0 Flash and Whisper Large V3 throughout a number of languages in FLEURS & Common Voice benchmark checks.
ElevenLabs had developed the speech-to-text part for its AI conversational agent platform, which was launched final 12 months. However, that is the primary time the corporate is releasing a standalone speech detection mannequin. In a dialog with TechCrunch final month, CEO Mati Staniszewski talked about bettering speech detection fashions.
“We wish to perceive what’s being stated by you in a dialog higher. We are engaged on methods to maneuver away from solely producing content material and understanding and transcribing speech,” Staniszewski stated at the moment. “Many individuals say that speech-to-text is a solved downside. But for a lot of languages, it’s fairly dangerous. We suppose we will construct higher speech detection fashions as a result of we now have in-house groups to annotate information and provides us fast suggestions.”
The mannequin additionally has good speaker diarization to let you know who’s talking, timestamp at phrase stage for correct subtitles, and auto-tagging sound occasions like viewers laughters. The startup is offering a manner for patrons to straight transcribe video content material so as to add subtitles or captions in its studio.
Scribe at the moment solely works with pre-recorded audio codecs. The firm stated it is going to launch a low-latency real-time model of the mannequin quickly. That means it isn’t but efficient for assembly transcriptions or voice note-taking.
ElevenLabs is pricing Scribe at $0.40 for an hour of transcribed audio. While the speed is aggressive, a few of its rivals supply a lower cost for audio transcriptions in the meanwhile with some function differentiation.