The prime enemies to accurate speech recognition from mobile devices, are the detrimental effects of low bit-rate coding / channel errors / frequency response clipping for bandwidth conservation / and "white noise" inserted in speaking pauses & silences by cellular networks when speech is sent over communications channels, all of which are common - in fact, now prevalent.
DSR proposes to do several things to overcome these limitations to the digitizing and transmission of speech signal, to-wit using a reconstructed speech waveform from a low-overhead algorithm, across the data channels in newer communications schemes such as GPRS, EDGE, 3G, and 3GPPm to a back-end recognizer on a remote server.
A compressed, error-protected extraction of the speech features, including inflection (sometimes referred to as "spectral features") is digitally sent, instead of subjecting the user's speech signal to the various pollutants common in standard wireless communication channels.
This algorithm is performed by "front end" built into lightweight mobile devices with limited memory and processing resources which would be overwhelmed by trying to run installed speech recognition software, or alternatively, performing complicated, heavy ADC and speech compression.
A standardized DSR was proposed in Europe in late 1999 - early 2000, by an ETSI project named "Aurora"; and although it had momentum and a very effective noise-canceling DSR algorithm by Motorola's David Pearce, also Aurora's Chairman, Aurora died on the vine sometime around mid-2004 for a variety of reasons, among them (scuttlebutt) the unwillingness of mobile device manufacturers to support a standard protocol.
(Aurora's dilemma is discussed down below)
In February of 2001, DSR pages appeared in the IBM Israel-based Haifa Research Lab's website, noting with DSR, "speech accuracy is kept virtually unmodified" - and Haifa's research team demonstrated such to the 3GPP Group who then selected DSR as the recommended codec for Speech Enabled Services. The IBM Haifa Research site published a demonstration of DSR at work, & website pages about DSR updated until late 2004.
In May of 2001, L&H Automotive Solutions Group announced it was introducing DSR technology to the automotive market, noting " ...DSR-based interfaces give greater access to the wireless information and services that today's drivers are increasingly demanding." Unfortunately, little evolved from this, (that we know of) and shortly thereafter, Microsoft absorbed L&H and it's probable that any DSR projects/research fell by the wayside as a result.
In October of 2002, Conversay announced Qualcomm would be using Conversay's flavor of DSR in their MSM5100 chipset, but later data sheets and dev. news releases never again mentioned DSR.
In June 2004 3GPP approved the DSR Extended Advanced Front-end as the recommended codec for “Speech Enabled Services”.
(click here to jump to test data)
Very little seems to have happened with DSR for about a year; then in August 2005 the central IBM Research website began posting pages about IBM's efforts towards DSR; and IBM also noted a new, and attractive feature of DSR - single-channel multimodal synchronization, and, using only the data channel of emerging wireless schemes..
"... Using DSR, multi-modal applications can send speech data and application data together on one channel"
" ... The DSR solution typically works in a VoIP call environment."
" ... In these cases, all the information flows through a single data channel, as opposed to a voice channel that is used, for example, for regular phone conversations. Most new cellular communication technologies, such as GPRS or 3G, support data channels. "
When we were looking again at DSR, and publishing this post, our Project Mgr. was advised by David Pearce of an encouraging article that appeared in the VoiceXMLReview.org website, in December of 2005. David's page offers some very nice comparison data on DSR and some in depth looks at it's implementation, and moreover, DSR is available.
We'll post another blog entry soon, after we follow up with all the places David pointed us to!
The advantages of DSR are tremendous - using a cell phone for an example, when a user makes a cellphone call, the analog of his voice, from the microphone, is first converted to either GSM (Cingular, T-Mobile), CDMA (Verizon) or PCS (Sprint). (Click here to skip technical data)
This initial conversion narrows frequency response, typically from about 200Hz to about 3.8kHz. Once this signal arrives at the nearest tower, further changes might be made to the signal for it's journey through the cellular network. For instance, in many areas the Verizon network operates across the local Sprint PCS backbone, in areas where Verizon doesn't have tower coverage, requiring further small changes,
Next stop; for our discussion here, the PSTN.
At this juncture, the signal is now converted into either G.711, or ADPCM 32kbps for it's travel through landline networks. It arrives at the remote server, where speech recognition is performed on a signal that's been truncated, converted at least twice and did we mention white noise cellular networks add in?
The server attempting to perform recognition is sorely taxed trying to listen to what's left of speech after signal crunching along the way and channel error, in addition to the poor reproduction that low bit rate coding of originating wireless networks produces in the first place.
As David notes in his VoiceXML page, "By performing the front-end processing in the device directly on the speech waveform rather than after transitions with a voice codec, the degradations introduced by the codec are avoided".
DSR's process is to sample the analog right off the microphone, and then use a data channel (no intermediate codecs) to transmit the Advanced Front End to the back-end recognizer/remote server. *from David's page; "For the comparison below, a professional transcription house was used to transcribe sentences from the Wall Street Journal that had been passed through the DSR reconstruction and other reference codecs."
Take a look at the impressive fact that there is less than 1% error rate in the transcription.. and only slightly higher (.2%) than unprocessed voice signal.
[Number of missed/wrongly transcribed/partially transcribed words]
3GPP (3rd Generation Partnership Project), who set standards for GSM & UMTS mobile communications, tested DSR for a new work item called Speech Enabled Services (SES). Two candidates, AMR/AMR-WBT (the existing voice codec for 3GPP) and DSR were evaluated.
Two ASR vendors undertook testing - IBM, and the former SpeechWorks, now Scansoft. The performance evaluations were conducted over a wide range of different databases, some brought from 3GPP, but also proprietary databases owned by the ASR vendors. Both candidates used the packet data channel rather than the circuit switched channel, as well. It's important to note, when viewing the comparisons below, that an 11.5khz signal is entirely possible today, contingent upon available upstream data rates, and with wireless carrier broadband 16khz is no longer out of reach, either. In any event, the channel error improvement alone is terrific, but DSR is a significant overall improvement at any data rate.
[Results from ASR vendor evaluations in 3GPP]
With DSR, and in the happy eventuality that DSR becomes a standard..
Manufacturers could release horsepower they'd otherwise have to apply towards their own proprietary attempts at either local speech recognizers to packetize voice or running their proprietary compression schemes and devote those resources to local audio component upgrading, giving the best voice signal available to form the reconstructed waveform, for instance... (Wouldn't that be cool.. handhelds with really superior microphone elements? )
Today - DSR offers an answer to so many hurdles for better wireless speech recognition..
As proponents of better wireless speech recognition, we strongly advocate DSR as an answer to many, many problems the technology faces when trying to integrate into the mobile, and the WiFi device landscapes.
ASR systems could function far more robustly, faster and with uncanny accuracy. Although "local" recognizers on devices are maturing somewhat, e.g. voice dialing and local dictation of text messages, and the computational power of these devices is increasing, the complexity of bigger vocabulary speech recognition systems is beyond the resources of even today's best devices - but DSR precludes any perceived need for such as an alternative to recognition performed on voice signals carried across communication channels. If DSR became prevalent, speech recognition would extend quickly and broadly into more and more online applications and telematics - the potential is truly amazing. And that's just a beginning. Extremely accurate dictation, remote command and control applications and highly advanced, synchronized multimodal features would be within easy reach.
What's wrong? Well, pure capitalism (we're fans) for one.. And that leaves DSR at the end of a two-edged sword.
While it's certain that healthy competition among developers would eventually bring the best DSR available, once again the word "standard" rears it's ugly head and we hear the cries from here.. "That takes away our competitive edge..!"
Moreover, as David cites on his VoiceXML page, there's an understandable "You go first dilemma" quandary; Here's David's description of the hurdle to making DSR a default type interface:
"It’s been a bit of a “chicken and egg” conundrum with server vendors waiting to see widespread availability of DSR in handsets before making product commitments and handset manufactures similarly asking “where are the recognition servers to support applications?”. "
But, as David notes, this is no longer..
"The data connectivity provided by 2.5G GPRS networks to support the transport of packet switched speech and multimodal services are already widely deployed and wider bandwidths of 3G data networks are being launched. With the adoption of DSR by 3GPP we have an egg! "
It would seem we do!
To the consumer, DSR's benefits would be plentiful. No matter what mobile device you used, and what remote server your mobile device is talking to, across whoever's wireless network, consumers could enjoy effective, accurate ASR speech recognition that even resists ambient noise interference. Ever tried to talk to an IVR system near traffic? In a crowded mall setting? DSR's Advanced Front End provides state-of-the-art robustness to background noise. In addition to almost unmatched reproduction accuracy, AFE offers a 50% decrease error rate compared to the mel-cepstrum, and about twice the noise-canceling of comparable, useable algorithms.
DSR's voice device benefits don't end there, either. Because of it's low cost, fairly "dumb" wearable wireless speech devices (lapel microphones, handheld/mobile headsets) themselves could begin carrying DSR as a way to communicate across the network with a remote recognizer, and the need for a handheld simply removed, in many instances.
Just think.. if all this was somehow standardized - wouldn't it be nice to see devices get a little less expensive, for a period..?