Whatever Happened to Voice Recognition?

“In my experience, speech is one of the least effective, inefficient forms of communicating with other human beings.”

Jeff, I respect your experience, but I completely disagree with your implied conclusion. Yes, speech is inexact, and people can ramble, but the density of information sent back and forth via the human voice isn’t restricted to only the words spoken. Tons of subtle cues that we take for granted in day-to-day communication - stress, tone, intonation, pauses, etc - alter meanings and advance conversation.

This all gets stripped in other forms of communication, for example, email. I can not tell you the number of times where a situation came up in my work where email kept getting passed back and forth trying to resolve an issue. A quick phone call ultimately resolved the issue/misunderstanding in 30 seconds. I now use a rule of thumb now that if more than 4 round trips occur for a given issue, I pick up the phone and give the person a call.

Which is easier for a non technical person to understand? “Tell you computer to open browser.” or “Click the little icon in the lower left of your screen, open the Programs menu, find Firefox somewhere in that huge list.”

Now, it’s true that we’d need really go AI in order to interact with a computer naturally. But I think we could improve user experience if we could create custom commands with voice recognition.

For example, you walk into your office. As you walk across the room to your desk you say, “Computer, open project control the world.” Then, as you are sitting down, your computer automatically opens up your world domination plans exactly where you left them. It could even have a voice print checking feature. And with cameras, facial recognition. Essentially, just shortcuts with voice commands.

So, it’ll be a while before we can interact with computers like they do on Star Trek. But we could have some useful features today. I wonder why we don’t… For the shortcuts with voice commands, we’d just key it to a recording, so it should be possible to ignore the error rate.

When you were researching this, what did you find out that people have been doing to try and solve the problems?

The reason speech recognition doesn’t work is the same reason we don’t have automated generation of novel software, automated debugging, the semantic web or any of a number of other software tools - because they all require approximately-human-level AI to provide the necessary judgement, context, extrapolation, fuzzy logic and culture.

I mean, even if I misheard you when you said “balloon boy”, I could still make a judgement, based on recent cultural events, that you meant to say “balloon boy”. But that takes an incredible amount of processing power to make such a conceptual leap. And computers just aren’t up to the task yet.

The good news - the human-level-AI problem is solvable, it’s just really, really hard. We will solve it someday, and it will render a huge swath of our day-to-day tedium automate-able.

But today is not that day, tomorrow isn’t looking so hot either.

[I]s voice control, even hypothetically perfect voice control, more effective than the lower tech alternatives?

Depends on the situation, I guess. You astutely pointed out that people still regularly leave voicemail, but it takes SO LONG to listen to the actual voicemail. (well, compared to reading the same message as Text.)

I presume, by lower tech, you mean pushing a button or key on your keyboard. I think voice recognition can be useful. The ideal situation is those automated telephone menu systems. But I can also imagine it being incredibly useful in hands-free operation of a phone.

Enunciation and Projection

Most people speak sloppily. They don’t enunciate or pronounce their words properly. Nor do they know how to project their voice so that it is consistently clear. Our evolved brains use a lot of pattern analysis, interpolation, and contextual clues to fill in the gaps. While some speech recognition programs have some of this, they certainly don’t have the comprehensive suite of these techniques that our brains have both learned and evolved over the years. It will take awhile to actually figure out how they work in an algorithmic sense.

Back to my assertion about people speaking poorly. I once installed Dragon Naturally Speaking for a customer of mine who had a very outgoing demeanor, so in telephone conversation and in board-room meetings, people always could hear him and understand what he was saying. Yet, when he tried this speech recognition program, it clocked in at around 90% recognition because he wasn’t used to talking to a computer. Oddly enough, when a phone call came in while I was there setting it up for him, he (mentally) switched over to his proper mode of speaking (still wearing his head-set that came with the program) and recognition shot up to 99+% because he was speaking clearly and enunciating his words properly.

Once he got off the phone, it dropped back to 90%. It was quite funny, but clearly illustrates that most people don’t typically understand how to talk. (well, at least, how to interact with an emotionless computer.)

And well, I guess that’s the real part of voice communication that we’re not talking about here:

HOW you say something is just as important as WHAT you say.

Voice transmissions have a sub-channel that carry emotional information. Computers, currently, are completely unable to even detect and process this information. I’ll concede that it is generally irrelevant in speech-to-text processing, but it’s still part of the contextual clues that we, as humans, use to fill in the gaps of what is being said.

I think it’s telling that you choose to use examples that are, in almost every case, more than 5 years old when discussing the utter failure of voice recognition. Run your own test and see, rather than just report on others’ reports of failure. Spend 5-10 minutes training the speech recognition in Windows 7 and then try dictating your blog post. I suspect that you’ll find, if you do an honest attempt rather than deliberately trying to foul it up, that the recognition quality is very good.

I should also mention here that your thesis is a little confused. Are you attacking voice recognition (which really means identifying a voice), speaker-independent speech recognition (what Star Trek led you to dream of) or speaker-dependent speech recognition (which is what Windows 7 or Dragon Dictate offer)? Voice recognition software is actually quite reliable these days, although most of us have no use for it. Speaker-independent speech recognition is what really sucks, unless you constrain it to a very limited input set such as numbers for telephone IVR systems.

Ultimately, I suspect that if you had pain when operating a keyboard, you’d start finding alternatives like speech recognition a far more worthwhile investment of your time and energy - and then you’d be pleasantly surprised by how effective it can be.

I’m a coder and I need brain motor-functions fast recognition, not voice recognition as input.

Handwriting consumes one hand and inhibits you from using the other, which directly influences your efficiency… Yes, if you can type faster than handwrite (which most people can) obviously handwriting recognition is pointless.
Speaking with your voice does not consume either of your hands. The key to understanding the potential of voice recognition is to avoid assuming that it will be the single interface to the computer. You can keep working with the mouse/keyboard (or whatever hand-focused interface exists at that time) and, if you want to do something that would be faster if you didn’t stop your hands from what they were doing, you simply talk to it.

However, people can speak simple commands faster than they type and faster than they move a mouse, since general-purpose computers usually have so much they can do that everything becomes buried in the UI, even the simple things. The comment that David Reagan made above exemplifies this point wonderfully.

“Wouldn’t it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent, to sum the column – rather than methodically and tediously saying the words “sum this column” out loud?” – I’m sure something similar was once said about toolbar buttons (“wouldn’t it be easier to open the File menu and click Save instead of searching for the tiny little save icon amongst all those buttons up there?”) but we learned pretty quickly that you get used to it and don’t really need to “look” for the button, despite how many buttons there are.

“I suspect that’s still not good enough in the face of the existing simpler alternatives.” – No doubt StackOverflow was not precisely what you wanted it to be when it began. If you had said, “Oh, this isn’t absolutely, mind-blowingly spectacular today, so I think I’ll just give up on it,” where do you think we would be today?

And finally… “In 2004, Mike Bliss composed a poem about voice recognition. […] The real punchline here is that Mike re-ran the experiment in 2008, and after 5 minutes of voice training, the voice recognition got all but 2 words of the original poem correct!” – If this is not a direct and blatantly obvious display of the improvement of voice recognition software, I don’t know what is. Yes, it took 5 minutes of voice training. Now he doesn’t need to do that voice training anymore - it was 5 minutes, once. And if we don’t just give up on voice recognition now, then eventually that will be 1 minute of training, and maybe someday the AI won’t need training and will change its expectations as it gets to know you better.

The point is… There are places where voice recognition helps (as a third interface to the machine: hand/hand/voice) and there are places where it simply doesn’t belong. Just like the computer itself. But if we aren’t willing to understand that, then we’ll never see what it can do. If we had decided that computers were a waste of space and energy when they were the size of rooms, where would we be today?

There are three different problems with very different error rates: recognizing anybody’s structured speech, recognizing one person’s unstructured speech, and recognizing anybody’s unstructured speech.

If you can define a structure around your interactions with the recognizer (http://en.wikipedia.org/wiki/VoiceXML), you can cut out the list of possible matches, and really increase the confidence measure of the transcription. We’re using this approach here: http://www.fidelus.com/locator2.html

This works today, and works well. We’ve done demos on a speakerphone at a noisy tradeshow booth with very few issues.

You are exactly right that recognizing anybody’s unstructured speech is a very hard problem to solve. That’s why commercial services still use people to do transcriptions if the confidence measure is below a certain threshold. Over time, these services learn the voice of the people that call you frequently.

I don’t think Google uses human transcribers, which is why GVoice transcriptions are hilarious.

There are certainly places where I am very grateful to have voice control available to me today.

A big one is while I’m driving. My car is a Ford Focus with Ford’s “Sync” system installed, and I can use it to control my bluetooth phone, and even my stereo by voice. Pressing the voice prompt button and saying “call john smith at work” or “play artist The Beetles” seems to work with nearly 100% reliability (the exception being some of my phone contacts or music artists with very unusual names). Being able to do these sort of things without taking my hands off the steering wheel or even looking at a screen is very nice.

Another place is the transcripts of voicemails provided by Google Voice on my Nexus One. Granted… the transcripts are occasionally TERRIBLE, and are rarely 100% accurate, as you’ve noted above. However, as an at-a-glance summary it helps me to determine how urgent the message is, and if I need to listen/respond to it right away. Consider the following two (made up) messages:


Josh, this is forgotten hospital. Your mop’s been in an accident and you need to fling as soon as you can. Please boil 555-555-5555.


Hi Josh, this is Jane. Veal were considering seeing Boy Store Tree at the theater tonight and thought maybe wood like to come too? Please bet me now.


These are made-up examples, but they’re pretty typical of the results that I get (actually Google’s voicemail transcripts are often better than the above, but the worst English voicemails I get look something like these). Yeah, they’re not accurate. Yeah, I’ll probably listen to the message anyway. Still, I can learn a lot about the relative importance and “ignorability” of the message thanks to the unreliable transcript, and I find value in that.

So why in the world – outside of a disability – would I want to extend the creaky, rickety old bridge of voice communication to controlling my computer? Isn’t there a better way?

Just because you can’t imagine the benefits doesn’t mean they don’t exist.

Of course, only an idiot would argue that voice recognition will ever replace other methods of human-computer interaction. When we say that Voice Recognition would be useful, we don’t mean that it is useful for everything. But there are plenty of scenarious where traditional forms of computer interaction just aren’t appropriate.

Sure, desktop spreadsheets are easier to mouse / keyboard. They’re designed that way. But when we’re doing something else with our hands it would be really useful to have machines do what we tell them.

Wouldn’t it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent

What… when you’re driving? I don’t think so, Jeff.

I think you’re right to bring up text input and small devices. Early texting on mobile telephones was fiddly and difficult to use, so people invented a short-hand where they didn’t spell anything properly, used the word “loll” all the time and added a message terminator in the form of a smug, smiling, winking face.

So, the solution isn’t to change the technology but the behaviour of the people using it. I imagine a voice recognition future where some new language is invented in order to leverage the technology - an unambiguous vocabulary enhanced by pops and whistles and that noise you can make by sticking your hand in your armpit.

Jeff Hawkins, who created Graffiti for Palm is co-author of a great book which touches deeply into why current computer technology fails so badly at things the human brain does well–like vision and speech recognition. The book is “On Intelligence” co-written with Sandra Blakeslee. In the book, Hawkins discusses how human (and other) brains use relatively simple decision making and vast access to memory and abstraction to interpret messy input like speech or vision. He makes a good argument about why current approaches to things like general speech recognition can’t get beyond a certain hump and proffers a theoretical new approach based on human wetware instead of computer hardware. One of the best books I’ve ever read (okay, listened to) on the subject. More info at http://www.onintelligence.org/index.php

v1: Speech.
v2: Handwriting.
v3: Typewriting
v4: Computer

Why ever downgrade to v2 or v1?

Apropos of nothing, when did quoting break? All I see when poster X quotes poster Y and replies is a series of paragraphs formatted exactly the same. Seems to me I remember a few years back that quoted text was automaticaly italicized. Bug report?

Perhaps it’s just that I have a fairly neutral “Midwest American” accent, but every time that I have used speech recognition since Microsoft’s Speech API 4 in 1998, after training it has been 99% accurate for dictation and 100% accurate for command and control.

My wife, on the other hand, can’t get speech recognition to work properly in any mode, even with training. Even though we are both from the same town about an hour away from Chicago, she can speak with a Chicago accent while I cannot.

Does anyone know how it’s works in different language?
English language opposing to others have very irregular spelling. In addition, many quite different accents.

For example, words with some pronunciation but with different spelling:
weak, week; sent, cent; sun, son; bye, buy; sum, some, piece, peace; meat, meet; too, two; pears, pairs; weigh, way; rode, road; …

How it’s looks in Spanish?

As someone who usability tests IVRs (automated phone systems) for a living, I have to say that speech recognition currently works very well, given the right domain. The same could be said for keyboards, mice, trackpads, touch screens, etc. What constitutes a niche depends on your perspective. You probably physically touch dozens of computers a day, of which only a few have a keyboard and a mouse/trackpad. Just today I’ve used a car, a Playstation 3 (a.k.a. Blu-Ray player), a digital watch/heart rate monitor, an alarm clock, a Palm Pre, a microwave oven, a VOIP phone, and a Mac Pro. Clearly the Mac is a niche in my daily experience.

The interesting thing about speech is that it’s conversational. In our tests, we’ve found that for a good voice app, accuracy above about 75% doesn’t improve the user experience. Why? A good app will prompt you to repeat, often with suggestions on how to improve accuracy. This isn’t burdensome for the user, since it’s the same thing that humans do when they can’t understand what you say. Indeed, the target for a speech app shouldn’t be perfect recognition, it should be to be within the ballpark of a human listener-- along with an error recovery script that’s also as good as a human.

The problem with your critique of speech recognition is that it’s complaining that speech doesn’t work where it’s an inappropriate input mechanism. That’s a tautology. I, for one, am glad I don’t need a mouse and keyboard to operate my microwave oven, car, and digital watch. That doesn’t make them bad technology, just inappropriate for the use case.

IMHO Henry Kuo makes a much better point than anything in the original post. It is indeed a trap to think that if one method is not best overall it is the simply depreciated. It is rare to see such a vision-less post from you.

You would not want to use voice rec for coding but then, an all touch interface wouldn’t work best either. Does that mean it does not work on a phone? Clearly no. Handwriting recognition is not efficient for writing a full essay but using a tablet pc as my sketchbook and then being able to write the name of a particular drawing (so it is findable in a mass of scribbles) before moving on to the next is still one of the best computing experiences I have. It makes sketching both effortless and functional.

As mentioned by an earlier, the voice rec in Win 7 is quite good. Probably good enough for a captain’s log in fact. I suspect it will be pretty good in Kinetic but of course they’ll be embarrassing videos a plenty because it is not mission critical worthy, the “two incorrect” words is that difficult final mile, but it does bode well for the Star Trek future in which we simply tell background computers what we want to happen. If not tea, earl gray than at least basic tasks in which the really depreciated buttons like tv remotes and on/offs slowly start to fall away.

The last 20 years taught us that imperfect software can still be useful (see any web app). And that some good ideas take a long time to become useful (see tablet computers).

Also, until recently, automatic translation was a joke, for similar reasons. No longer: http://translate.google.com/translate?u=http://www.asahi.com/&sl=ja&tl=en It still has a long way to go, but a computer currently translates Japanese better than most humans, and that can already be useful.

Ironically, even in Star Trek (TNG) they recognized that if you really needed to get something done quickly, the best thing to do was sit Data down at the console and have him type furiously at robot speeds. Make it so!