Whatever Happened to Voice Recognition?

“In my experience, speech is one of the least effective, inefficient forms of communicating with other human beings.”

You might be forgetting just how much information we can transmit through so few words, with things like tone, inflection, etc. The issue with language is that each message is not self-contained; on the contrary, it depends fundamentally on the information each party already has. So I think the problem here is not the inefficiency of language but rather the immense difficulty of digitally recreating a system of such complexity.

Of course, one reason for using speech recognition is to reduce the learning curve for computer use. But we’ve seen how this level of intuitiveness can be reached without speech recognition in software like the iPhone OS.

Jeff, I see you read and quoted from the Robert Forstner piece (or peas). Did you read the comments on that piece?

Several gentlemen actively working in NLP took serious issue with his claims about the field flat-lining, specifically criticizing the study that claimed 80% accuracy. I quote: “State of the art wide coverage parsers are currently sitting around 88-95% accuracy, not 80%, with >99% coverage (meaning a successful, though possibly incorrect, parse of 99% of unlabelled unrestricted text).”

Consider taking another look and possibly involving some additional sources - you have a widely-read blog, and it would be a shame to pass on misinformation.

See also: http://www.reddit.com/r/programming/comments/bzbdf/rest_in_peas_the_unrecognized_death_of_speech/

While accurate, speaker-independent voice recognition with no constraints on vocabulary or context is still a long way off, there have been enough improvements over the last 10-15 years to make speech input really genuinely useful in more specific scenarios.

I’ve been working on voice control for Windows Media Center for the past couple of years, and find it works very well indeed. The key benefits I’ve found are:

  • The ability to choose a single musical artist, film, TV program, etc from a decent sized media collection that may include thousands of alternatives - without needing to drill down through menus.
  • The ability to issue commands without interfering with onscreen action (e.g. changing the music while a slideshow is running)
  • Instant access for things like jumping to a particular point in a movie ("skip to 47 minutes")
  • An audio input device that only listens when users are issuing speech commands, rather than trying to make sense of all the random sounds it hears (we use an accelerometer to intelligently unmute the mic when needed)
  • The alternative, for our target audience, is a normal remote control; most living room users don't have a keyboard & mouse conveniently to hand for controlling their TV experience

During our development, it became very clear that the single most important thing needed for good speech recognition is a high quality microphone system. Most PC mics are lousy for this (limited bandwidth, etc.) which makes the voice recognition engine work much harder. Garbage in, garbage out. (Bluetooth is even worse, due to limited bandwidth.)

I think this is why most users who dabble with speech recognition find it generally poor, even though the quality of, say, Windows 7’s built-in speech recognizer is actually pretty decent.

It will be interesting to see how much Microsoft’s Kinect (aka Project Natal) pushes the speech-in-the-living-room experience forward: with 3D cameras that can accurately identify a speaker’s position in the room, coupled with an array microphone that can focus on that position, the potential is there for very good recognition.

And another thumbs’ up for Jeff Hawkins’ book On Intelligence, mentioned by an earlier poster: well worth a read for anyone interested in all types of machine recognition.

A very good summary, but there are some things which should be pointed out.

  1. The graph is from the Sphinx Open Source recognizer from CMU.
  2. It is from 2003
  3. Sphinx today is 10 years behind IBM, Nuance, Google, and MultiModal (and 6 years behind Microsoft).
  4. You are proposing many strawmen and burning them. What about the places where speech works?

The truth is, depending on the domain, many applications have crossed the 4% WER barrier. You just don’t notice anymore. Did you know that all live closed captioning is dictated? Captioners ‘respeak’ what is happening and use macros. The majority of medical transcription is done with computers, and then corrected by correctionists (mainly formatting and billing code extraction/confirmation.)

Ever call 411? State and City please? (that is speech recognition presenting a limited phone book to the operator). Google Grand Central has huge raving.

If you still think Speech Recognition does not work, download the Dragon Dictation or Search apps for the iPhone/iPod/iPad. They are free. Try them out with some full hard sentences. Don’t just try single words, or things you would not normally say. Try it for real. then decide.


@Eddy: Kinect will only react to keywords, nothing more.

I always wondered about the interest in speech recognition, handwriting recognition I can understand - it’s just easier taking notes on a pad than on a keyboard (at least to me). The only application I could see for it is when you’re pacing around the room brainstorming, but then the problem becomes that you’re pacing and brainstorming, neither of which tends to produce clear speech.
The best thing they could do is couple this with language recognition, to recognise patterns and non-sensical sentences. So you have a certain % of recognition of which word it could be and you weigh that off against which word makes the most sense in context, but that’s a whole different issue we haven’t grasped yet either. Context sensitive language.

guys, the reason why a machine cannot recognize speech well comes from a linguistic matter: human speech recognition mecanism itself is a kind of mystery.

“Speech recognition, no matter how good it might become, will never work for all aspects of human-computer-interaction.”

Insert any other form of input in place of “Speech recognition” and that statement is still true. Keyboards are not very helpful in image manipulation, the mouse/trackball is terrible for word processing and an X-box controller is not so hot for playing Guitar Hero. Each has their strengths and weaknesses, and speech recognition is no different.

Actually the big difference is that each of those inputs has a very limited amount of information that needs to be processed… a keyboard only has so many keys, a mouse has X/Y coordinates and a few buttons, and a game controller has buttons, a control pad, analog stick, etc. But a voice is an audio stream of data that must be recorded, background noise filtered out, words must be parsed out, and without actually understanding the language it’s impossible to differentiate between homophones. Considering the niche application of voice recognition, not a lot of people are concerned about solving those problems when other interfaces can be munged to work well.

I think it’s weird that everyone feels the needs to slam a technology as useless because it’s not universally useful. Heck, even in Star Trek which is used as the quintessential example of voice recognition still had sophisticated textual/graphical interfaces that you didn’t talk to.

Speech Recognition is the technology of the future, and always will be.

Just out of curiosity I turned on voice recognition in Windows 7. I skipped the training. Next I read the poem. It got all but one word correct. This was using the built-in microphones on my laptop in a fairly quiet environment. Last time I tried this I don’t remember it being nearly as good of an experience.

I think general voice recognition can be very sensitive to the particular user’s voice. I use Google Voice and I find that the transcript is really hit and miss, however it is hit and miss depending on the speaker. For some callers it is very close to 100% every time that particular caller leaves a message. For others, it can be very bad every single time.

Improvements in voice recognition have been slow but not nearly as slow as suggested I think.

Dear aunt, let’s set so double the killer delete select all

I use Google voice recognition all the time, both Goog411 and the Android voice recognition. Both are remarkably effective and useful, and the pervasive availability of voice input is (IMO) a killer feature of Android phones.

I’ve heard the reason Google gives away the Goog411 service for free is it lets them acquire a massive, user-corrected training set for their VR algorithms. It seems to be working. I’ve had cases where Goog411 handles cases (“Le Boulanger Café”) that human operators failed at.

It’s in the car where it really seems to have found it’s niche. Hands-free voice activated dialing and bluetooth over the car system means not having to look away from the road, take your hands off the wheel, or even touch your phone to make a call.

Or setting the destination for the nav system while driving. Or even the radio. And so on.

But still I get frustrated with my own car’s system. Partly due to some design flaws that cause minor annoyances. And partly due to it sometimes just not understanding what I’m trying to tell it. Even with an extremely limited vocabulary and command set, and even with the ability to train the system to my own voice, it still manages to get it wrong some of the time.

So yes, there’s a long way to go. But I think it can prove to be a wonderful tool for some limited applications like that. Basically anyplace where your hands and eyes might be tied up doing something else.

Although the post is mostly about communicating with computers, you do mention the drawbacks of trying to verbally communicate with other people (via a computer). The benefit of verbal communication, and the main reason it is used, is that it offers up the possibility to use your hands for something other than communicating, while you’re communicating. That means you can do something -and- communicate at the same time. If you’d only use your hands you’d have to choose between either doing something or communicating. In areas such as gaming this makes a huge difference, because it doesn’t allow for the breaks in action needed to type something to our fellow gamers. Also typing would be too slow and probably inaccurate when trying to type something fast, to be efficient enough. Also, you wouldn’t have the time to look at the chat window (whereever it might be on the screen) because you have to focus on something else entirely. Here verbal communication offers a necessary complement to typing. And you are right, it is needed to be rather precise but it really doesn’t take long to master proper commands, if it’s in a field of your interest.

can’t live without it on my android phone, and it’s pretty decent on windows 7 too. “whatever happened”? - it got good, and people use it as their preferred option every day.

the baseline human voice transcription accuracy rate is anywhere from 96% to 98%!

I bet the numbers will go down if the listeners are blindfolded and are asked to listen to random conversations. From the software point of view, it’s an unfair competition.

I find it funny when “futurists” try to incorporate an age-old broken communication medium into current technologies. I can’t figure out why restaurant drive-thrus still have a speaker and microphone! Haven’t they screwed up enough orders to figure out that the system is flawed? Even humans can’t communicate effectively with speech, which is why we are constantly saying “Come again?” and “Say what?” to each other like idiots. So how can we expect a computer to figure it out?

By the way, who had the crazy idea to incorporate text messaging into cell phones, and why? Seems to me like older people struggle to understand this more than younger folks. “Why would you ever want to type when you can just talk?!” they proclaim. Because some people, at some times, would rather press a hundred letters instead of saying a dozen words. The messenger can get their point across without drawing attention to themself. The message is sent when the sender wants it to be sent, and it is read when the reader wants it to be read. What’s wrong with that?!

Text/typing is also flawed, which is why it bothers me to still see things like voice/hand recognition being discussed. We would be better off thinking about the next NEW method of communication instead of trying to conjure some hybrid of past failures. This was a good post, Jeff!

IBM Develop World’s First Artificial Game Show Contestant - And He Cleans
Up REAL Well


I can imagine stackoverflow.com understanding my commands and I could answer the question a lot more quicker. If not me than hope for next generation developers. Something “not” similar:

Me: Yes

Me: :slight_smile: :slight_smile: :slight_smile:

I suggest you try Android voice recognition. Not sure if it goes above 80% accuracy, but it’s quite good. Voice recognition is very neatly integrated: any text field of any Android app can be talked into.

I realize you didn’t create that graph, but I’m disappointed you would reproduce it given that it is highly misleading in several ways…

  1. It uses the good old trick of manipulating the y axis to suit whatever interpretation of the data is wanted. Replot that data on a linear instead of logarithmic scale and voice recognition software suddenly doesn’t look so dismal.

  2. The data set shown here in red was cherry-picked from several found on the original NIST plot. When the number of words that need to be recognized is fixed and limited (such as commands that can be issued to a computer), voice recognition performs quite well.

  3. The last data point on the plotted set is from over 8 years ago. Lack of data since then in no way implies that no progress has been made.