This powerful, useful implementation opens up "discoverability of multimedia within a Web search" says Wilde, and he's very correct. No longer does a searcher only find results centered on the just the metadata of a video, but the actual audio content of that video, as well.
Everyzing's 2-part speech recogntion technology first turns the audio of videos into text; then it analyzes, extracts and indexes key terms, entities and concepts within the text. This enables new multimedia category indexes containing search related terms that may only appear briefly or just by mention inside any video's spoken audio!
The recognition, analyzation & indexing process's core is from BBN Technology. Everyzing is combining it's Byblos engine, and BBN's two Information extraction from Speech and Information extraction from text technologies. BBN is a leader in speaker-independent recognition accuracy for speech in different environments, including telephony and broadcast news.
CEO Tom Wilde has been frank about the present shortcomings; He notes the accuracy drops (understandably) against background music and/or multiple speakers. But for the infotainment & news markets he's targeting right now, the technology should offer a significant improvement over what's currently available, he says. "I think we'll look back in a couple of years and say, 'Of course the content of multimedia files needs to be searchable,'" says Wilde. We agree!
A comprehensive look at the core technology by Technology Review can be found here, as well.