The rise and fall of voice

Voice controlled computers are the dream of many but they remain stubbornly in the realm of science fiction. Apple’s Siri assistant and Google’s Voice Search are first steps in making these dreams a reality; but it’s one thing for your phone to read out your schedule, another thing entirely for your computer to understand what you’re asking.  

Transcribing words

Current voice recognition systems use statistics to turn sounds into words. Once sounds have been transformed into a list of characteristic properties by a signal processor, the list is fed into two statistical models. The acoustic model tries to match up sounds to units of speech that form individual words – if the first part of a word begins with a certain sound then the next sound is a choice of the possible sounds that would complete the word. The language model works in a similar way and looks at which words are likely to occur next to each other. The models rely on an analysis of masses of text from books and newspapers along with hours of voice recordings that has identified certain pairings and patterns as being more probable.

Not surprisingly, background noise, accents and poor sound quality all adversely affect the accuracy of voice recognition systems, but having more data won’t necessarily help, “the performance improvements people report at conferences every year are matters of fractions of a per cent on incredibly hard tasks” says Roger Moore, professor of spoken language processing at the University of Sheffield.

Another big challenge is adapting to different surroundings and contexts. In particular, the fast, conversational speech used in casual situations is hard for computers to understand – speakers merge words into each other and leave computers trying to comprehend incomplete sounds. Again, it’s not simply more data that’s needed, “the more important and more interesting thing is using the data in a smarter way and getting better acoustic and language models” says Steve Renals, Professor of Speech Technology at Edinburgh University.

Systems like Google Voice Search and Dragon Dictation basically transcribe sounds into words, but getting a computer to understand and act on what you say is a very different and more complex task. In certain limited scenarios voice control can work well, “if you can limit the domain or topic of conversation then you can do much better than in an open domain, discussing philosophy for example” says Roger.

Tuning in to our feelings

Researchers are turning to emotions to improve the way computers understand us - a field called affective computing. Studies have shown that people who have suffered damage to the part of the brain that controls emotion make decisions that resemble artificial intelligence, so by including emotional factors in to their programming, computers should make better decisions.

Getting emotional data from speech is similar to transcribing words but it’s more complex as there are more variables to monitor.  The speech rate, length of pauses, changes in volume or pitch, richness (think “choking up”) and level of sound filtering by body parts are all affected by the emotions of the speaker – there are thousands of signals to measure.

And it’s not just the number of signals; everything we say carries some sort of emotional signal and it’s hard to separate them out. “In real life, our voices and our faces are emotionally coloured. They don’t go into nice discreet boxes. People have levels of emotions which rise and fall.” says Roddy Cowey, professor of psychology at Queen’s University Belfast.

Levels of emotions are also open to interpretation. To try and get round the subjective nature of classifying emotions, current research uses a panel of humans to judge what people are feeling while theyinteract with computers. These emotional levels are then compared to the signals measured to improve the system.

Uncanny Valley

Unfortunately, if computers do ever get to the point where they can understand our words and how we say them, it might not be a good thing. There is a hypothesised crisis where interactions between humans and robots reach a level of realism that is uncomfortable and disconcerting. “As systems get more human-like you enter what is called an uncanny valley where robots are human-like enough that you expect them to have other human-like properties and if they don’t, they seem weird and frightening.” says Roddy.

So perhaps the question should be less about when we will be able to create computers that can draw on all the experiences and knowledge amassed over a lifetime when having a conversation, and more on whether we should be heading down this route at all.

Embedded youtube video

The folowing links are external

You might also like: