Whatever Happened to Voice Recognition?

Remember that Scene in Star Trek IV where Scotty tried to use a Mac Plus?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2010/06/whatever-happened-to-voice-recognition.html

Speech increase the human-effort, and poeple are too lazy to speak. my bet is on 3D gesture recognition. imagine i see a new word on screen, create an imaginary circle around it with my index finger and drag and drop ( in 3d without touch ) to the new tab. something i can do in a second. and it feels powerful.

May be we need Hand-waving computer control like in Minority Report have a offices full of people waving their hands instead. Been more fun to watch.

You forgot the obligatory link to the Bill-Gates-On-Voice-Recognition-Memorial-Page: http://mpt.net.nz/archive/2005/12/30/gates

According to him, it will come to a PC near you in two to three years, since 1997.

In the USA, isnā€™t voice control already popular in IVR systems?

Over here in the UK, we mainly still use push-button IVRs (ā€œFor accounts, press 2ā€). Iā€™m not sure why. It might be partially due to cultural issues, and perhaps partially due to the wide range of different accents within a relatively small population.

I remember graffiti on my Palm IIIx - I could write pretty fast. Even so, the onscreen keyboard on my Samsung Tocco Lite is much easier, even if I do feel a bit fat-fingered on it sometimes.

Google rewrites classic literature by the brilliant Bakery.

Itā€™s interesting to note that YouTube mangles subtitles far less for the American accent.

The handwriting recognition on my dadā€™s old Windows tablet was pretty decent though. Also, when I get voicemail by Google voice, the speech recognition is pretty nice - except when the voicemail is in a different language. Then itā€™s just funny.

Iā€™m totally with you on the idea that it wouldnā€™t be that useful, even if it did work. I think folks get excited about the idea because it seems very natural, ergo it must be easier than the current forms of control we have.

But it depends on the domain. Would you want to control a carā€™s steering via voice recognition? I doubt it. Speech recognition for cars would be great if you could get in and say ā€œGo to workā€, and it drove you to work, but the magic there is automatic driving: a ā€œgo to workā€ button wouldnā€™t detract from the experience one bit, other than making the demo seem less magical.

Iā€™m much more excited about reducing and removing interfaces, rather than putting a boil-the-ocean amount of work into making them different.

Voice-recognition does seem to work pretty great for booking movie tickets over the phone though.

The voice recognition built into the Android operating system does a very good job, and actually has a purpose - it allows you to configure destinations for Sat. Nav., and search the web without having the fiddle around with a tiny soft keyboard.

It would be interesting to know whether this tech. achieves better than 80% recognition. Iā€™m pretty sure it exceeds that from my own results (English Home Counties [read ā€˜poshā€™] accent)

"spoken communication puts a highly disproportionate burden on the listener."
I fully agree!
What is funny is that you mention podcast laterā€¦
Personally, I avoid these numerous and surprisingly popular video tutorials or podcasts, partly because I am much better at reading written English than at understanding spoken English (depending on accent, too), because I am French.
And mostly because most of the video tutorials I saw are excruciatingly sloooow, we wait for the cursor to slowly move and hesitate to a menu, move elsewhere, etc. And I donā€™t have sound at work anyway.
It is much faster to read a tutorial on a Web page, to print it out to read on the public transport, to skip some parts I already know, etc.
I understand the interest of video in some fields (eg. a demonstration of a manipulation in an image editor), much less in other fields (typing code in an IDE!).
But perhaps I am too old fashionedā€¦ :slight_smile:

Beware of falling into the mindset that if some form of input is worse overall than another (voice vs keyboard/mouse) that it is unuseful out of hand. It is the one-or-the-other kind of mindset that often leads to people taking sides and arguing which is better. Instead of thinking how voice control would replace our current ways of using computers, try thinking of how it can enhance our experience.

Of course it would be ridiculous imagining a workplace where everyone is giving basic commands to their computers. Shared spaces like that must always be considerate of how it affects others. But at home, the consideration is far less. Laying in bed, who wouldnā€™t love to just say, Star-Trek like, ā€œComputer, lights off.ā€

One other thing Iā€™d like to point out, that would give a very simple insight into how voice can enhance our experience, is that while reading is far quicker than listening, saying something is far quicker than typing, and often quicker than locating and pressing a button on a particular screen.

Even a human typist needs to have the text dictated to him, which isnā€™t the same thing as plain talk. To be able to actually talk to a computer, itā€™ll need artificial intelligence, knowledge about you and the context of your conversation, and probably also a camera with software that can understand body language.

A camera also helps to solve the problem of computer knowing when you speak to it, as opposed to other people/computers.

Generally I agree with you. However the transcriptions of MIT classical mechanics lectures on youtube, http://www.youtube.com/watch?v=PmJV8CHIqFc , are quite good. Some others are soso.

I confess Iā€™m not entirely convinced by the idea that itā€™s still uselessly terrible. Command interpretation - and, more importantly, comprehension - as seen in SF remains beyond us I suspect, but the last time I tried dictation software (Dragon NaturallySpeaking) I found that, with some training of the software and some practice on my part to avoid umms and aahs, it was of comparable accuracy and speed to general typing. In 2000, on a Pentium II.

I suspect there is a greater problem though. While we work in noisy, shared environments, or use our home computers with others around while weā€™re watching the TV or listening to music, dictation as our primary means of input is a fundamentally flawed concept. We have to vocalise our trains of thought to all and sundry and neither us nor them are likely very keen on that. I suspect itā€™ll end up as another form of assistive technology, like screen readers for the blind at present.

Interesting topic, indeed. Speech recognition, no matter how good it might become, will never work for all aspects of human-computer-interaction. Very useful for physically challenged people for sure, but your sample with the room full of people trying to control their computers is a good example why it wonā€™t work in practice. It is already disturbing enough when somebody next to me starts talking out loud and I wonder if heā€™s talking to me, just to find out, heā€™s using his bluetooth headset for a phone call.

Another problem is, that human beings can understand ā€œcontextā€ and ā€œsituationsā€, computers cannot. So if there are 10 people around me and Iā€™m start talking, people will know when Iā€™m talking to one of them, or to a group of them, or to someone else. They will either know by the fact where Iā€™m looking, who my eyes are focusing or by the context of what I say. How can a computer know something is a command for him or talk to my coworker? When I say to my coworker ā€œJust go to Google and search for ā€¦ā€, I donā€™t want my computer to do this; how shall my computer know?

My biggest glitch with computers are mice. I use a trackball, which I consider much better, but still not perfect. Touchscreens, Tocuhpads? Donā€™t like them. Keyboards with sensitive surface and gestures recognition? Donā€™t like those either, because they are basically touchpands. My dream is that I can have a normal keyboard one day, with normal keys, for typing, but I can just lift my hands a bit up and make a gesture in the air and the computer will understand it, so I donā€™t have to move my hands far away from the keyboard at any time, just to move a window to the left, make it bigger or open a menu. Sure, you could do this all with keyboard shortcuts, but that is not as effective as using a mouse, at least not with the operating systems I have to work with.

I so disagree!

Wouldnā€™t it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent, to sum the column ā€“ rather than methodically and tediously saying the words ā€œsum this columnā€ out loud?

You are trying to eat soup with a fork!

If you could say ā€œCan you please sum column D15 and place the results after the last populated cell, and could you also save the worksheet for me after thatā€ - Then YES, it would be faster than clicking (especially for non-IT folks).

Of if you are filling up a classic user profile page and could simply blurt out your address without having to carefully break it up into Street/Zip/Country, it would be delightful! Of course, if you are going to have to say, Postcode IS xxx, Country IS xxx, then NO, that is TEDIOUS as you point out.

When voice recognition software and hardware mature and allow us to speak as fluently as we do in day to day life, THEN voice recognition(and not control) will definitely be more effective than the lower tech alternatives

Because we do not know how to MAKE IT, does not mean we do not know how to USE IT.

Funny that the poem line ā€œit gets it wrongā€¦ sometimesā€ failed while ā€œsometimes it gets it rightā€ is recognized successfullyā€¦

* typical spoken communication tends to be off-the-cuff and ad-hoc. Unless you're extremely disciplined, on average you will be unclear, rambling, and excessively verbose. * spoken communication puts a highly disproportionate burden on the listener. Compare the time it takes to process a voicemail versus the time it takes to read an email.

Carefully composed writing is a lost art in many parts, lots of people write like they speak.

The reason that Palm Graffiti worked is that it used the limited vocabulary approach. However, I think that for handwriting recognition, this is probably the best approach. Make the user learn a new alphabet, in which all the ambiguity is removed, and which can actually speed up writing, because you can actually make the letters simpler. Itā€™s much easier to write a letter T, or A on Graffiti (http://en.wikipedia.org/wiki/Graffiti_%28Palm_OS%29), because they are reduced to a single pen stroke. my only problem with the Palm was the lack of friction, causing it to have an unnatural feeling, quite different from writting with a pen on paper.