Whatever Happened to Voice Recognition?

I’ll add to the recommendation for “On Intelligence” by Jeff Hawkins. He clearly postulates why he believes AI the way it is currently being tackled will never work, and uses speech recognition often as an illustration. A must read if you’re fascinated by how the human brain works, and what the (potential) future of AI would/could be.

A fun little personal experience:
I called a friend of mine recently and left a voicemail consisting entirely of me saying “penis” in every imaginable inflection/accent/volume/etc (I’ve known him since high school…this is par for the course with us).

Well, my friend uses Google Voice (I think that’s the name), which takes your voicemail and translates it to text for him to read. Somehow, the application translated my message into a bizarre, but somewhat logical message from Denise, telling him that she would be late, but it would be nice to meet him. We were absolutely bewildered by this…

I wouldn’t call Graffiti hand-writing recognition.
Jeff Hawkins turns it on it’s head - he had the user get trained rather than make the machine learn. It recognizes a relatively small set of ‘strokes’ and mathematically determines which is the best fit.

Better to call it a limit ‘print-writing’ recognition at best.

I appreciate your thoughts on voice recognition, but you dismiss useful cases on the “fringe” too handily. I’m glad you stipulated that you found valid use in voice recognition given disability; I have a close family friend who makes her profession as a writer, and her Parkinson’s disease makes this, a source of her identity, more and more difficult as her disease progresses. Voice recognition software lets her continue to “write”, but as you mention, this software isn’t at our desired Star Trek Quality Level quite yet.

Doctors have long recognized the impact of “quality of life” treatments on overall patient health. For instance, patients with degenerative, incurable diseases like alzheimer’s can temporarily stave off cognitive decline with simple things like playing scrabble or socializing.

I hope that, given the massive investment in R&D for complex biological treatments by drug companies, somebody, somewhere, takes the time to consider what might be achieved by devoting significant brainpower to quality of life treatments, like voice recognition software for writers whose bodies betray them. I don’t think of voice recognition as a luxury, I think of it as a healthcare issue. If it spills over to be useful for us in the mainstream, that’s just a pleasant side effect.

Currently I am working on a voice command application for Android and for the most part users find it accurate enough to be useful.

Some things I do to combat poor recognition (Google’s) are noticing similar words, hi vs high as the same in certain situations and many types of fuzzy matching in others. Solutions vary case by case but there are certainly ways to feel that 80% a little less.

For the most part, its novel as a cool party trick but echoing what others have said, I have many disabled, impaired or pure hand free users who find voice recognition, while flawed a great way to interact with something they previously could not with no additional hardware.

The best example of this is YouTube’s attempted transcription of a glaswegian accent: http://www.youtube.com/watch?v=lt-DWr7208A

I suspect that a big driver behind speech and handwriting recognition was simply that 15-20 years ago there were a lot of adults in their 40’s who had never learned how to type, much less use a mouse. I would rather click on a column and sum, but that’s because I grew up learning how to do this quickly and accurately.

White collar professionals at the peak of their careers in the late 80’s were often told that typing was a skill for secretaries and that all they had to do was give dictation. The secretary would do the “speech recognition”.

Have you tried Windows 7’s handwriting recognition? It’s pretty damn accurate!

Voice recognition still sucks in Windows 7 but after training it’s usable.

How ironic that most people use their iPhones, Blackberrys, Androids, et al for anything BUT verbally communicating. Using a telephone to actually SPEAK with another person? How quaint.

The biggest problem with vocal interaction with computing devices may simply be that we’re infatuated with the romantic Hollywood versions of that. Like programming seems so dynamic and cool in movies, we’re not prepared for how tedious and laborious the real versions of these interactions usually are. Given that Star Trek’s teleportation scheme was just a creative way for the show’s prodcers to circument budgetary limitations, I suspect that voice recognition was likewise more a means of quickly moving the story along than actually depicting or predicting a useful future technology.

On the handwriting tangent, ya hafta wonder why Jobs & Co. didn’t build handwriting recognition into the iPad. Is it that he simply didn’t see the need to compete with tablet PCs, he didn’t believe it’d be successful enough, or that he views handwriting as a useless and all but dead technology?

Here is some official information about IBM’s Watson.

IBM BlueGene/P (a.k.a. “what is Watson?”

http://www.research.ibm.com/deepqa/

People who know me also know I am very passionate about sci-fi. So
when I saw this today I just couldn’t help thinking about HAL in
"2001: A Space Odyssey" as well as Skynet in the various Terminator
movies.

Neil Rieck
Kitchener / Waterloo / Cambridge,
Ontario, Canada.
http://www3.sympatico.ca/n.rieck/

Last year I had a problem with my right wrist and was unable to use my hand for 10 days. The first thing that came to my mind was searching for a voice recognition software, only to find out that it simply doesn’t exist nowadays.

I agree that for non-disabled people it wouldn’t be the most productive way to use a computer, but it would be excellent for people who are disabled in some way.

I mean, except for thought recognition, which would be much better.

I don’t know for voice recognitions, but handwriting recognition is a very significant variant of Chinese input systems. Without that, I couldn’t imagine how my Dad could learn how to type into the mobile phone Chinese names of his friends, colleagues and relatives.

The art of Chinese typing through a 105-key keyboard, let alone a 12-key numpad on phone, still has quite a learning curve. Therefore Chinese handwriting recognition systems are actively developed and improved through these decades, bridging the gap for elderlies. I remember when I was small, the recognition rate by the computers in public libraries is quite low. Nowadays it is a lot more satisfactory.

I think you are missing the point. It’s not about speech recognition error. It is about finding the appropriate application for speech recognition such that users will tolerate an error here or there. I has to be the kind of application where the user will gain a huge benefit from using Speech Recognition.

My Android App seems to be a good example, 1250 people used speech recognition to have their recipes read out loud. I think they just don’t want to get their million dollar phones sticky, is that worth waiting through a few speech recognition errors?

Speech recognition is still alive in this app: www.digitalrecipesidekick.com

For general purpose (unconstrained) voice recognition, the best thing we could hope for (without real AI) is probably about the same level as an Aspie/Autist understands verbal communication; i.e. very literal. This might be good enough for cheap closed captioning, but not for making a more intuitive user interface (at least not for a non-geek). For geeks (closer to the Autism end than the Normal end of the autism spectrum) it might soon be good enough for some usage… :wink:

The real problem is that if the speech recognition requires a fair amount of training, a tactile interface is likely to be easier to learn to use fast.

Dictation using a human typist is not faster (from thought to final written text) than typing it yourself if you’re a good typist, though a good stenographer may manage to extract the essence and write it down during a brainstorm faster than an amateur would, it’s very unlikely that an unintelligent machine would ever be able to do so…

Hi Jeff, I have pretty strong opinions about this (I guess a lot of people do, but I normally don’t have strong opinions about many things).

  1. Not nearly enough attention is paid to the argument for people with disabilities.
    With an estimated 75% of the population having some sort of disability (I don’t have the stats on me, but check out the whole chapter in “Don’t Make Me Think V2”), it’s not so much about whether YOU “…would I want to extend the creaky, rickety old bridge of voice communication…” to anything, but whether or not it would benefit the world as a whole. 75% of the population is lots of people - the potential majority.

Look no further than your own brainchild, SO - For developers, it’s not perfect, but people need it and “it works”: http://stackoverflow.com/questions/87999/voice-recognition-software-for-developers

  1. Most reviews of this type of software are full of… it.
    How is it that anyone who ‘reviews’ voice recognition or hand writing recognition spends a whole 5-10 minutes training/using it and calls it crap. I get that some people have the attention span of a goldfish but this software isn’t magic (yet). As a result, the reports of voice and hand writing recognition being crap are highly exaggerated.
  • I write software that relies on Windows XP Tablet PC Edition and the hand writing recognition works - given that you don’t write like a slob. It’s not perfect and it doesn’t have built in learning yet, but when it does, it’ll only be better.
  • Dragon Naturally Speaking, once trained works VERY well. The CEO of my company is functionally blind. He uses a screen reader with MS Mike and Mary as well as Dragon to great success. He dictates about 30-40 e-mails a day in between everything else he does. I’ve found errors in maybe 5% of his e-mails to me. Take a look at your inbox: that’s likely better than most people’s typing.

It’s worrying that, given your public voice, people might take what you say here as a truthful indication that these technologies don’t work, and aren’t worthwhile. They do, and they are.

One common technique sometimes seen portrayed in hard science-fiction is subvocalization. Basically, your brain thinks about speaking, your body starts to go through the motions of forming speech, but no actual detectable noise comes out. The general idea in the fiction seems to be that the computer doesn’t directly attempt to transcribe the sounds of the speech as such; instead, it reads the muscle twitches or nervous impulses or brainwaves or something to determine what you were trying to say.

At least it’ll be quieter down at the local Starbucks.

I just want to second that the handwriting recognition in Windows 7 is, in my opinion and experience, quite nice. With absolutely no training on my HP TM2, it reads both my print and cursive with maybe a 1-letter error per sentence. But by far, the best part is their interactions after you’ve written something. Completely intuitive. Don’t like a word? Draw a line through it, and it’s gone. It got a letter wrong? Tap the word, and it explodes it out by letter. You can then insert, delete, or overwrite individual letters.

Its only drawback is that it is dictionary based, leading to lots of corrections for words which, well, aren’t really words.

I started using voice recognition almost 20 years ago, due to overuse injuries (15 years as a programmer was more than my tendons could cope with). Back then it was discrete speech, and required about three hours of training. (reading lists of words, then letting it tweak a user specific model) Now, it takes minimal training, and allows continuous speech. (some do take samples of your writing style to tune its grammar model).

There are some real advantages. It is a MUCH faster typist than I used to be, and it can spell, something I was never noted for. Its accuracy is good enough to be annoying, at 95-97%. You still have to watch it, its rarely wrong, but when it is, your spell checker will be no help finding its mistakes.

By the same token its not much good for a programmer. The problem isn’t the language syntax, but the identifiers. The system is geared to insert correctly spelled words separated by spaces, NotMixedCaseMashedTogetherDevisedByTheVwlAlrgc. If its just your code, you can cope, picking ones that can be easily said, but if you like large systems, you will never be dealing with your code only. Its why I stopped programming, and started managing. I could produce code, but not fast enough to satisfy me.

Oh yea, while some systems provide a means to operate the mouse, they make generating identifiers pleasant. There are head and eye tracking systems, but they aren’t for the able, if you can move your upper body, it will never stay pointing where you want.

I mentioned grammar models above, that is the trick with accuracy - That 80% accuracy figure is typical English, without a grammar model. There are to many homophones in the language, to do much better. Did you want me to insert “there” or “their”? If you know the 3 or 5 words before you can make a pretty good guess. If you can delay inserting until you get the following word, you can do even better. When you are modifying a document, you want the system to be able to query the editor, and get the words surrounding the cursor, so the editor/browser/whatever has to be built to cooperate.

One of the real barriers to voice input is the modern open office. You did not want the next cube to a voice input user, especially in the old discrete speech days. Dana Bergen wrote this after some time spend with one of the early discrete speech systems. (that also required a wired, headset microphone if you wanted any accuracy.)
http://www.cl.cam.ac.uk/a2x-voice/a2x-faq.html#Thea2xOlympics

Summary: If you spend your time creating original English text, (like most of this post) it will be faster than your typing, and will improve your spelling. If your hands don’t work, but your mouth still does, it will let you rejoin the online world. Its not the answer to all the computing worlds problems, but no single tool is.

This post was untouched by human hands.

For a laptop or desktop, I’m with you: typing and mousing is a lot more effective than speaking.

But on a mobile phone with a screen that fits in your hand and a tiny software keyboard, typing is a royal pain in the ass. The device is designed to pick up the human voice, so it’d be nice if it could do more than just copy that voice to somewhere else on the planet. Unfortunately, the most frequent and annoying thing I type on my phone is my password, which is optimized for having lots of numbers and punctuation, not being easy to type on a simplified keyboard. And I don’t want to speak that into the phone, even if I could.

For universal speech recognition, we need to do a lot better job with semantics. The human perceptual system has an impressive amount of interaction between high-level and low-level language processing for disambiguation, filtering, prediction, etc. If you proposed the human speech system to a software architect, the high coupling, lack of clean interfaces, and legacy spaghetti code would drive him nuts.

Still, I can’t help thinking that the technology, properly integrated with a rigorous grammar model, such as exists in programming languages and IDEs, wouldn’t be hugely accurate and successful.

I believe that all the technology is ready NOW, for a near-perfect implementation integrating voice recognition for, say, Visual Basic.Net programming. This seems especially feasible with the addition of Visual Studio’s Intellisense rules.