Wireless Speech Recognition ..

Speech recognition is now primarily wireless; We've migrated fast, to universal wireless access-communcation devices.

Often, the speech recognition is remote based - And the better signal we send it, the better it performs.

Here, we hope you'll find ideas, technology or projects using hands free and/or mobile devices to make wireless speech recognition a rewarding and useful universal tool!

Friday, December 28, 2007

I posted this... with Jott

↑ top

I am posting p this using the Jott application. I am doing this from my cellphone. If it happens to work incredibly accurate, it is because quite often, human transcribers are the ones typing what you are reading. (Click here to listen to the original audio) *Powered by Jott

We posted the above by cell phone, using the Jott application.

(We've blogged about cool stuff Jott does, before ;-). For a cell phone_to_web_application; Not bad, eh .. ??

Monday, December 24, 2007

Your "Talking Character" says Happy Holidays!

↑ top

Want to send a neat-o talking holidays card.. Truly a little different?

Check out PQ Computing's "Talking Photo" application - With a few clicks, here's what one can do, according to the Talking Character web page:

Use one picture to make realistic 3D faces for animation
Animate any human / animal photo, painting or drawing
* Automatically match lip movement with voice
Most major languages are supported!

'Ya gotta check this out..

A nice cool way to spice up the holidays, eh?

Labels: Happy New Year, Merry Christmas, PQ Computing, Talking Character, Talking Photo

Thursday, December 20, 2007

Distributed Speech Recognition is emerging..

↑ top

We posted an article about Distributed Speech Recognition ("DSR") a little over a year ago.

The great news is that among other big players in the speech recognition marketplace, SRI International, makers of the advanced Speaker Indepent DynaSpeak SDK, have incorporated Distributed Speech Recognition as an intregal part of the SDK.

From the DSR section of the DynaSpeak pages on SRI's website:
"To date, speech recognition systems have been deployed in two ways: on a remote server or pre-loaded on a mobile device. Either approach forced makers of mobile phones, PDAs, PCs, and consumer and automotive electronics products to accept tradeoffs. To eliminate design sacrifices, SRI has created a third mode of deploying speech recognition: DynaSpeak with Distributed Speech Recognition (DSR). With DSR, a user's speech is preprocessed on the user device and transmitted over a low bandwidth channel to a full-featured server-side system. The benefits are numerous: higher quality audio capture, lower cost per device, and centralized management of speech applications."

SRI isn't the only big player to incorporate DSR; Hewlett-Packard has been developing distributed speech recognition as a power-saving solution for wireless devices:
"We have shown that DSR can reduce the required systemwide energy consumption for a speech recognition task by over 95% compared to a software based client-side speech recognition system. These savings include the software optimizations of the DSR front-end as well as the savings from the decreased duty cycle of the wireless interface."

Nuance also built DSR into their OpenSpeech Recognizer 2.0 that's availaible with their Network Speech Solutions.

We cheer the efforts, and hopefully handset manufacturers will begin to support it inside 3G mobile phones; the "Chicken or the egg" dilemma David Pearce, DSR's chief developer spoke of at both speechTek 2005 and in his VoiceXMl articles is solved - now that it's "hatched" let's hope it continues to grow!

Monday, December 17, 2007

Mobile speech recognition advances!

↑ top

AT&T Wireless announced today it has partnered with Barcelona, Spain based Code Factory to bring very advanced speech recognition services and highly advanced screen reader capability to AT&T Wireless phones.

Code Factory's "Mobile Speak" is available for mobile phones using Symbian Operating System versions 6, 7, 8.x and 9.x running the Series 60 Edition interface; nearly all Windows Mobile Smartphones and Pocket PC's.

Mobile Speak for Symbian-based mobile phones includes impressive controls for the Code Factory mobile screen readers such as:

"Commands that allow changing of important settings on the fly, such as keyboard echo, punctuation level, verbosity level, speech rate and volume."
"Commands for repeating or spelling the last spoken text, reading the whole screen or just parts of it like the softkeys, as well as interrupting speech output or toggling speech mute."
"Command Help Mode available anywhere on the phone."

But these are only the true speech recognition commands; the overall speech interface for Mobile Speak itself offers a truly amazing set of features. The Mobile Speak for Windows Mobile Smartphones and Pocket PC's includes Microsoft Voice Command for wireless control of the device, whether or not a native recognizer is already present in the device.

Additionally, the Mobile Tools product suite (which includes Mobile Speak) includes Remote Access applications that allow the users to remotely perform tasks on their mobile phones, even if a screen reader or screen magnifier is not yet installed.

Code Factory boasts a rather admirable mission philosophy..

"Some Mobile Tools are developed by Code Factory to meet specific needs of Mobile Speak and Mobile Magnifier users. Other Mobile Tools are applications from 3rd-party developers and then supported by Code Factory to work with our screen readers and screen magnifiers. Our goal is not limited to just making mainstream mobile technology accessible, but also to harness its full potential and offer users a multipurpose solution in one portable device."

Friday, December 14, 2007

Speech recognition & AI, formatting your to-do lists?

↑ top

Meet CALO - An artificial intelligence project Funded by the Defense Advanced Research Projects Agency (DARPA), and coordinated by SRI International. What's interesting to us is it's research into using speech recognition to understand, organize and assemble task information.. from everyday meetings.

CALO promises, among other things, to use speech recognition to make a transcription of what's said during meetings, and using it's understanding of a user's projects/contacts, to compile to-do lists and appointments!

Imagine.. When you logon every morning, your computer reminds you it's assembled a list of tasks and meeting requests from yesterday's meeting - while you were asleep. How cool is this?

The military's research into AI coupled with speech recognition goes back at least a quarter of a century, in 1982 a study was published regarding different AI alogorithms applied to speech recognizers.

The study of artificial intellience is thought by many to have originated in 1947 with the mathmetician Alan Turing, although it seems he initially believed 'Machine Intelligence’ to be an apparent contradiction in terms.

What CALO promises is a far cry from Turing's "Can a machine be intelligent?" musings; and we believe it's yet another step towards fulfilling Bill Gate's recent prediction, that speech recognition will quickly become the next user paradigm!

Thursday, December 13, 2007

Speech recognition's accuracy better than human transcription..

↑ top

Welcome news for speech recognition proponents!

HealthImaging.com (a site for Healthcare IT professionals) posted a news article yesterday, December 13, about a presentation at the Radiological Society of North America (RSNA)'s annual meeting last month, documenting that the reports which were manually transcribed by humans, showed higher error rates than the reports that were transcribed through speech recognition!

John Floyd, MD a partner in the 24-member Radiology Consultants of Iowa (RCI), reported “The rate for significant errors, requiring the preparation of an addendum, was 0.6 percent for speech recognition and 2 percent for traditional transcription.”

Floyd also noted speech recognition significantly increased his firm's efficiency: "Separate data for this practice indicated that average turn-around time for traditional transcription was greater than 24 hours while that for speech recognition was less than one hour.."

Dr. Floyd further confirmed that the accuracy rate for speech recognition reported by his group was independently verified by 3rd party analysis, conducted at one of the hospitals his partnership services.

In an On10Net blog post, Bill Crounse MD, Healthcare Industry Director for Microsoft Corporation predicted earlier this year that speech recognition would open up new vistas in the healthcare industry..

We're pleased to see his predictions coming true!

Labels: accuracy, healthcare, remote speech recognition, transcription

Wednesday, December 12, 2007

MIT's Browsing through speech inside videos

↑ top

MIT's new CSAIL (Computer Science and Artificial Intelligence Laboratory) "Lecture Browser" may be raising the bar on searching the spoken audio in videos, for indexing. In fact, it's receiving over 20,000 hits per day - and it is to date only indexing lectures.

Originally funded by Microsoft and first announced in August, the Lecture Browser offers results in either video or audio timeline sections, the section containing the search term is highlighted, and snippets of surrounding text are displayed. The searcher can also "jump" to the relevant section of the video directly from the index, as well.

There are some impressive features built into this rather advanced application.

Optimized Speech Transcription:
- The speech recognition has been trained and configured to accurately transcribe accented speech, using short snippets of recorded speech spoken under various accents.
Accurate recognition of uncommon words
- A massive vocabulary has been trained into the system's lexicon, allowing it to recognize extremely uncommon scientific terms, et al

The system includes software designed by MIT, that segregates long strings of sentences with common topics into high-level concepts.
- "Topical transitions are very subtle," says Regina Barzilay, professor of Computer Science at MIT. "Lectures aren't like normal text."
  The software takes (approx) 100-word blocks of text and compares them to calculate the number of overlapping words shared between the text blocks. High repetitions of key terms are given more weight, and chunks with the highest rate of similar words are grouped together.

MIT's efforts to optimize the user experience are on-going. In the future, users will have the ability to contribute transcript corrections much like the "Wikipedia process", further improving transcription accuracy.

Even more impressive: MIT's plans include the ability for the system to learn from these corrections, as they propogate to other transcribed lectures.

A more comprehensive overview can also be read here.

Labels: accented speech, browsing, MIT, speech recognition, transcription, transcription learning, Videos

Saturday, December 08, 2007

Twitter by speech!

↑ top

The short_&_sweet?

Using www.jott.com; Sign up for a free Jott account and then connect the service to your Twitter account. Then you can call a Jott phone number, tell the system where you want your comments to go, and then you leave a voice message that gets converted to text and posted to your Twitter.

From the OpenSourceMarketer Post:

"The service is extremely accurate and easy to use and since twitter can be connected up in lots of places you could effectively generate new content in hundreds of places just by making one phone call and essentially leaving a voice message.

The Jott service also offers the ability to connect to Tumblr, WordPress, Yahoo Groups, LiveJournal, Amazon, and lots of other sites which is great because it puts blogging and written social communications in your pocket."

How cool is that?
True, wireless and remote speech recognition. And no gadgets to buy.

And just the other day, one of our members posted an article in Techdirt about this being the killer mobile speech recognition application!

Labels: blog post, mobile speech recognition. Jott, Twitter

Speech-controlling Bluetooth devices..

↑ top

Foxlink Group and Sensory, Inc. have teamed up to bring speech I/O (speech recognition and spoken prompts) to Bluetooth products.
How cool!

Foxlink's US Operations President, Mr. James Lee, notes: “Our customers want improved user experiences for their headset products. With Sensory’s voice recognition and speech output technologies we can add more features, while making the products easier to use".

According to their announcement's web page, speech recognition makes Bluetooth device controls far more convenient and intuitive; and having a pleasant voice confirming status and commands, rather than beeps or light flashes, adds to the VUI experience.

Once again, speech recognition easily solves a common problem: "Do I hold down the right button, or the left button? Where did I store that manual, anyway .. ??"

Moreover, analysts are predicting speech control over devices, commonly referred to as VUI (Voice User Interface), will quickly become a common device interface.

"I think this is an intermediary step for what’s going to come.." says Datamonitor analyst Daniel Hong.

Foxlink is one of the world’s largest independent manufacturers of Bluetooth headsets, a manufacturer and/or ODM for most of the major brands that use 3rd party silent OEM's.

Sensory is a market leader in embedded speech technologies and the 1st speech technology provider to port a VUI to a single chip Bluetooth solution with Cambridge Silicon Radio, a premier Bluetooth chipset manufcturer.

Labels: Bluetooth, devices, interface, speech controls, Voice User Interface, VUI

Searching the audio in videos, with speech recognition..

↑ top

In an interview with Everyzing's CEO Tom Wilde, he discusses a valuable step forward in letting Web users find the videos that they are interested in - using speech recognition to parse and publish the spoken audio streams, of videos that are posted to the Web.

This powerful, useful implementation opens up "discoverability of multimedia within a Web search" says Wilde, and he's very correct. No longer does a searcher only find results centered on the just the metadata of a video, but the actual audio content of that video, as well.

Everyzing's 2-part speech recogntion technology first turns the audio of videos into text; then it analyzes, extracts and indexes key terms, entities and concepts within the text. This enables new multimedia category indexes containing search related terms that may only appear briefly or just by mention inside any video's spoken audio!

The recognition, analyzation & indexing process's core is from BBN Technology. Everyzing is combining it's Byblos engine, and BBN's two Information extraction from Speech and Information extraction from text technologies. BBN is a leader in speaker-independent recognition accuracy for speech in different environments, including telephony and broadcast news.

CEO Tom Wilde has been frank about the present shortcomings; He notes the accuracy drops (understandably) against background music and/or multiple speakers. But for the infotainment & news markets he's targeting right now, the technology should offer a significant improvement over what's currently available, he says. "I think we'll look back in a couple of years and say, 'Of course the content of multimedia files needs to be searchable,'" says Wilde. We agree!

A comprehensive look at the core technology by Technology Review can be found here, as well.

Labels: automated speech recognition, BBN Technology, searching, spoken audio content, Videos

Tuesday, December 04, 2007

Thousands of voices, incredible recognition!

↑ top

In an article published online August 11, 2007, IBM Research announces the development of a new Cell Broadband Engine™ ("Cell/B.E.") processor. This new Cell/B.E. is a streaming multiprocessor who's architecture contains a general-purpose IBM PowerPC processor, working with additional special-purpose processing cores jointly designed by Sony, Toshiba, and IBM.

This processor represents a very significant leap forward; for both speech recognition "in general" but primarily for what we all love to hate - the infamous IVR's we hear answering most of our phone calls today, to most companies.

IBM notes: "Speech recognition systems in telephony applications for automated call centers represent the largest segment of the speech processing market". How true, and how sad the IVR performances that we often encounter really are. Who hasn't begun to pound on the "0" key and/or shout "Operator! Customer Service!" in desparation when encountering one of these infamous sad perfomers..

Current multi-channel speech recognition systems, that use "clusters" of traditional CPU's can manage between 20 to 30 speech channels in real time.

The Cell/B.E. can handle thousands of simultaneous voice channels in real time, and IBM states "On both the Cell/B.E. processor and the software platforms, recognition accuracy was 99%".

Take a look at the tremendous difference in performance, below:

(In the Table above, 1 RTC = 1 second of audio per 1 second of processing time)

The performance measurement above is based on speaker-independent recognition of a small vocabulary, based on the TIDIGITS corpus that included the digits "zero" through "nine" including the "oh" pronunciation for zero, using a propreitary IBM speech recognition engine (who's recognition algorithims are explained in detail). It's obviously fairly simple recognition testing, even though it included speakers of different gender and dialects.

Nonetheless, IBM notes the "performance of our prototype speech recognition engine on the Cell/B.E. processor can be extended to production systems because the SPE kernel programs were designed to scale with model and language complexity."

They further point out that due to the raw computaional power and memory management of the Cell/B.E, when used with systems that have large vocabularies and complex grammars, there will still be much higher recognition accuracy even when complex recognitions are put to task.

Additionally, they plan to add in models built on the Texas Instruments/Massachusetts Institute of Technology ("TIMIT ") model which is designed specifically for automated speech recognition systems.

Also interesting is IBM's plans to optimize the Cell/B.E. for recognition of compressed speech signals, and, they mention pursuing the proverbial Holy Grail, software-based noise canceling: "...we are trying to classify speech from background" !!

The research team goes on to say in their conclusion:
"We have implemented and demonstrated a prototype speech recognition engine that is capable of processing approximately 1,000 speech channels on a single Cell/B.E. processor. The kernel computations are designed to be highly scalable, and we expect this performance result to generalize well to commercial speech systems".

What a terrific development. IBM's embedded ViaVoice won Speech Technology Magazine's 2007 Market Leader award; and some of us remember when ViaVoice for continuous speech recognition (circa 1997, 1998) was the the best of the best. We admit we are definitely looking forward to seeing this technology emerge into the mainstream, and push speech recognition to new limits!

Labels: automated speech recognition, Cell.B.E., IBM, telephony

Wireless, remote speech recognition driven spybots!

↑ top

Robert at RoboDance, has created a video of some incredible wireless, remote speech recognition functions he's added into a WowWee Roboquad Robot. It is quite easily used as a "Spybot" for remote surveillance; and can be controlled remotely, across the Internet using speech recognition from anywhere! It's called RoboDance 4.

Using Skype's video call service, and an infrared link from the host PC to the robot, Robert shows how he makes this robot do some amazing things..

We could tell you more, but the video below has to be seen to be believed.
(** What's particularly impressive is during the video, Rob's robot listens only to it's commands, even though they're mixed with his video's continuous speech narrative!)

The details of the necessary hardware & software are available Here,
and Robert says the Robodance 4 is available now; joining his mailing list here gets notifications of how to purchase this pretty darn cool robot!

.

Labels: robots, Skype video call, surveillance, wireless remote speech recognition

Sunday, December 02, 2007

A speech-enabled newsreader, for the MAC!

↑ top

A news reader that supports speech recognition for the MAC is now available..
It's called MT-NewsWatcher.

It can be downloaded, from here

.

Labels: speech enabled news reader

Saturday, December 01, 2007

Excellent comments from a reader; & an answer..

↑ top

One of our readers has posted some very germane comments.

The first segment says:
"I don't know about all the areas your are talking about, but I do know that at least in my little corner of the world, the medical community either records notes and has those notes transcripted by a human using one of those little pedals, or if they are using speech recognition, have mandatory human review for inaccurate translations."

We're Microsoft fans, of course; but as of yet, Microsoft hasn't published any vocabularies. However, Nuance has released Dragon NaturallySpeaking Medical 9, with 14 medical specialty vocabularies covering 60 subspecialties which is receiving glowing testimonials in speech recognition newsgroups. Nuance says that with training, accuracy at 99% is achievable, and there's good evidence to support that claim.

As to whether or not physicians should investigate if speech recognition can now be trusted, here is a very, very interesting article from the Speech recognition blog, that specializes in covering speech recognition in the Healthcare industries. The short answer: Yes; and an overwhelming "Yes" !

Many users of speech recognition don't know and aren't told that accuracy is very closely related to the input mechanism (hopefully a wireless headset!) that's used to give the recognizer speech data. It is mission-critical, actually.

All the training in the world can't overcome an inherently bad voice signal. Remember the age old euphamism, "Garbage In, Garbage Out"..?
That holds especially true, for speech recognition. In the old days, we can remember when speech recognition software shipped complete with a $4.00 (retail) headset - and the software makers just couldn't figure out why their sales stayed minimal.

{** If anyone reading our blog wishes to find a pre-tested, pre-qualified headset or other microphone type for the best possible input performance, please consider contacting eMicrophones.com - the best and most trusted web vendor for speech recognition quality hardware.}

Our reader's comment's 2nd segment reads:
"And while the technology may be advancing quickly, for the common person, they don't seem to use it for much, and if they do use it, say on a cell phone or call center, it's hardly flowing speech recognition and is often a pain or at best, more timely.

How right you are!
An excellent piece about what really goes on with "voice over cell phone" can be found here in a post by a_chameleon, one of the Team members.

As to call center IVR performance, before we comment to that we've sent emails to Terry Gold and Marshall Harrison who are without doubt the leading experts on Microsoft Speech Servers and related IVR matters.

Our reader's 3rd comment segment reads:
"Maybe the reason it's considered as a flop by some is because the hollywood aspect of speech recognition. We expect it to do the things we see in star trek. Those are unrealistic expectations, of course, but they do exist. And the speech recognition industry doesn't necessarily abstain from playing on that hype when seeking funding, seed capital, advertising revenue, etc..... :)

Our best answer to that is - Everyone who uses desktop speech recognition should waste no time migrating, to the awesome speech recognition built into Microsoft Windows Vista™. Once anyone's watched Rob Chambers interview with Dr. Crounse video.. there shouldn't be any question - the "Star Trek" and "Hollywood" quality speech recognition we've anticipated for years.. is definitely here now!

Check back tomorrow for more!

Speech recognition - a "Top 10 Flop" says CNET Blogger

↑ top

Someone's forgotten to tell Steve Tobak, over at CNET, about where speech recognition really has evolved to these days.
He's declared speech recognition one of the "Top 10 Technology Flops"...

Hmm...
Let us count the ways?

Beginning with the medical profession - 40,000 active physician users generating about 18 million lines per month with speech recognition technology, in the US alone.

Or Field Automation... Equipment inspectors, mechanics, insurance claims adjusters, real estate agents, couriers and other highly mobile, hands-on employees are now using embedded speech recognition on portable devices making data entry and lookup faster and cheaper.

Smart Phone users easily searching the web by voice alone, with uncannily accurate and powerful results.

Notwithstanding smart phone capability - over 60% of "ordinary" cell phone users can just say to their cell phones: "Call steve tobak" and that phone number will ring!

Then there's Ford, (the car manufacturer); Do the words "Ford Sync" ring a bell? As in "play artist so-&-so" or "Call the office".. look ma, no hands!
And on the subject of cars - there is VoiceBox's new embedded speech-controlled technology that now allows us to control what our car's various features do and when, control our car's radios and navigation systems.

Is it also curious that GM is busily seeking Telematics engineers that can "Provide technical leadership for advanced speech technology development"..??

Let's not forget the major news networks over recent weeks, where one cannot watch in the AM without seeing at least one advertisement for Dragon Naturally Speaking?

As to "keyboarding" - Would this video help bring us up to speed, a little?
Where Rob Chambers does just about whatever he wants with Windows Vista™ built-in speech recognition?

Moving right along to the US Military and a company called Adacel, building technology allowing military pilots to interact with modern avionics, using simple voice commands, built on the awesome Microsoft ESP platform?
Or the IBM MASTOR speech-to-speech translation systems presently used by the military, in Iraq?
Speaking of IBM, does their joint venture with Cisco Systems to build speech driven self-service kiosks inside banks tell us anything?

And then there is the globally popular SpeechMagic system from Royal Philips Electronics.. which now offers extremely accurate remote speech recognition across a network, in 23 different languages?

But at the end of the day, we ask:
You've called a large company, sometime lately? What exactly do you think you might be doing when you talk back to that computer voice.. pray tell?

We hope our readers will forgive our quasi-diatribe..
But speech recognition is one of the most pervasive technogies around us these days - and it's only getting better!

** We've been told by a very reliable, trustable source, since this post went public, that just because a blog may be viewed on the CNET site, doesn't just automatically mean that it was written by a CNET employee!

Labels: speech recognition