Why Your Voice Assistant Doesn’t Work

Most people don’t really use their phone’s voice assistant. Even though it does an amazing job behind the scenes, the overall user experience doesn’t feel right.

If you think about how human interaction works, you may realize that we are always given something more than just a voice. We, humans, understand the overall context of the conversation. Furthermore, our brain constantly perceives and evaluates immense number of other things, even though we are not consciously aware of it.

The driving scenario

Let’s imagine this situation: You and your girlfriend are walking from the beach and you are about to drive home. You sit in the car, windows are open, and you say “Drive me home!”. If there is a person called “Remi” in your contacts (actually this our friend's name), the voice assistant may completely misunderstand you and respond: “Remi home… It looks like you don’t have a home address listed for Remi”. So why does this happen?

As I have already mentioned, the voice assistant is missing something. The overall context. And your girlfriend? She simply understands. But what she doesn’t realize is that she didn’t just hear your voice. Instead, she is also aware of an immense number of other tiny details: you’ve been on a whole day trip, you’re sitting in the car, you decided to go home, etc. So for her it just seems more probable that the sound you uttered really means “Drive me home!”. The sentence matches the situation.

Building the context

What if there was no voice assistant at all? What if you rather called your… mom. Just like that. She wouldn’t know you were on a trip, she wouldn’t know you are sitting in the car. And now, right after she picks up the ringing phone, you would utter “Drive me home!”. What would she really hear? Or even better: what would she understand? She would be very surprised at least. And so is your voice assistant.

If you provide her with further information she starts to understand. And if not, she herself would probably start asking questions. She would try to build up the overall context (“What are you talking about son? Are you OK?”). It’s a part of the human nature. Your voice assistant doesn't ask such questions yet. Or it does but only in a very limited scope. Now you can easily imagine how powerful the voice assistants will become once they really start to ask questions.

Speech and voice

Even in the previous scenario, your mom is still much more immersed in the overall situation than the voice assistant. How come? Well, she could have seen your phone number prior to taking up the phone. She also knows your voice. In fact, she’s been listening to you half of her life. She knows your accent, the way you speak. She already knows all your personal idiosyncrasies. But your voice assistant has millions of other “children” who can’t even speak English. And it has to to understand each and every one of them. So how bad is your voice assistant now?

To tackle these complexities AI recognizes two different fields of research:

Speech recognition¹, a field of research focused on recognizing the words being said regardless of who really said them. The words are what matters here.

Speaker/voice recognition² on the other hand doesn’t care about the meaning of the sounds. But it tries to identify the person who said them based on the characteristics of the speakers voice.

The big picture

AI needs context in every area, not only in speech and speaker/voice recognition. In fact this is the next frontier of the current research³. It all makes sense since we humans want the AI to be more human.

I usually like to demonstrate this on a simple example of pattern/image recognition: Let’s look at the following symbol. What can you see?

0

I see a circle. Or a ring, or a digit zero, or an “O” from the Latin alphabet.

Let’s zoom out a little:

10028

Now it’s more obvious. It’s a number for sure.

And if we zoom out even more:

The Metropolitan Museum of Art, 1000 5th Ave, New York, NY 10028, United States

It’s not part of a regular number anymore.

Now it’s a number of ZIP code.

Conclusion

We don’t even realize it, but we humans constantly perceive and evaluate an immense number of tiny details around us. And we do it without much effort. We just evolved this way. Even though we constantly hear about reaching the next milestone in AI, understanding the context is a tough one. If you plan to incorporate AI to help your business, you should really first understand it’s limits. Don’t overestimate how it can really help and don’t underestimate it’s weaknesses.

In Educards we did this by trimming the voice commands. You can't talk to Educards like you would to a chatbot, but your voice is at least recognized. This way we are able to provide a pleasant learning sessions. You can focus on your studies rather than chatbot misunderstandings.

[1] wikipedia.org, Speech Recognition (https://en.wikipedia.org/wiki/Speech_recognition)
[2] wikipedia.org, Speaker Recognition (https://en.wikipedia.org/wiki/Speaker_recognition)
[3] blog.adobe.com, Contextual AI: The Next Frontier of Artificial Intelligence (https://blog.adobe.com/en/2019/04/09/contextual-ai-the-next-frontier-of-artificial-intelligence.html)