Talking Sense

Speech is the most popular interface known to man. Long before the written word, man has used talk to influence and alter the world around him. Young children rapidly learn it, entertainers do it to amuse, diplomats prefer it to war. People talk to other people, to animals, to plants, to themselves and to God. The main reason humans invented written symbols and language is to have permanent records. Fragments of ancient texts that have survived to the present day state laws and business accounts as often as not. Like the QWERTY keyboard, sometimes objects can become so familiar that we forget they were originally designed to work around technical and physical limitations. If we could talk to machines, then we would. In fact, we often do talk to machines – we just do not expect them to respond. Even the most forward-thinking can forget this; Steve Jobs announced that “voice is the killer app” before showing off his new phone with a touchscreen. He was half-right. Voice is the killer app, but not just because people want to talk to each other. Phones are designed to be talked through. It would be just as natural to design phones to be talked to.

Speech interfaces are not new. Back in 2000, UK cellco Orange acquired the Wildfire Communications, and its voice recognition service, for US$142m. That deal was small compared to Nortel’s purchase of Periphonics for US$436m the previous year. But in 2005, Orange terminated their Wildfire personal assistant service due to declining numbers of users. As a consequence, Orange had to manage the uproar from a legion of visually impaired users who relied on Wildfire to make their calls. Wildfire would have been more popular, and would still exist today, if it had worked well for the general public. Talking is great because it is a fast and effective form of communication. You would not email the fire station if your house was burning down. But talking is frustrating and time-consuming if the person you talk to has difficulty understanding what is being said. The same is true of voice recognition software. Misunderstanding only a few percent of the words said may seem like a reasonable level of performance, unless you are the person not being understood. When Orange pulled the plug on Wildfire, they had to meet their obligations to disabled users by voice recognition software on the handsets themselves. This has one obvious advantage; the quality of the line is not a factor in whether the speaker is understood. The drawback of voice recognition software on the handset is that handsets lack the processing power to match the sophistication of software run on dedicated servers. So an approach based on thin clients, where a universal voice recognition service is accessed over a network, continues to be the most popular way to deliver this functionality. It is especially popular when provided as a common front end to a service like booking tickets or directory enquiries. In the US, the 1-800-FREE411 and 1-800-GOOG411 directory services are a good example of the latter. The reason for that, though, has more to do with eliminating the cost of paying call centre staff to answer calls than it has to do with providing an enhanced service to customers.

The breakthrough for speech recognition is perhaps just around the corner. Necessity is the mother of invention. Handset manufacturers have been riding the crest of a wave over the last few years, always able to come up with new additions to their devices in order to generate replacement sales. One of the interesting things about the iPhone, though, is that it shows the limit of the new ideas. Better screens, in-built cameras, music, touchscreens… but what comes next? Speech-driven interfaces are an obvious next step. The poor history of Apple’s own speech recognition software shows that the technical challenge is enormous, but they have reason to keep on investing in research, and not just because of the social obligations to provide communications to the disabled. If they do not, they will open up an opportunity for the networks to provide a valuable feature to their customers. Why store phone numbers on your device, if you could just call your network and then tell them, through a spoken command, to put you through to the person you name? Names pose an enormous challenge because, unlike commands, cultural and language differences cause many more variations in pronunciation. But after video, speech is the last great frontier for mobile communications. Whoever can get first-mover advantage in providing an effective voice interface to the most universal of demands – making calls, programming home devices like PVRs, and internet search – will reap the rewards.