How to Design Awesome Voice Interfaces

Since appearing in 2012, Siri has become almost a synonym for any voice assistant.

Over time, every smartphone had included it until 2018, when the voice assistant became a fully functional piece of hardware by itself.

Amazon Echo and Alexa, Google Home, and HomePod are the best known for their functionality and smart UX/UI design. However, according to various forecasts, the popularity of sharp objects connected to the internet and controlled only by voice will continue to grow.

In the world of digital design, we have always been used to designing mainly for visual systems, creating GUI and UX linked to the world of visual perception. We know everything about the Gestalt laws and eye movements of the user. However, in the case of voice-activated objects, there may be no graphic interface at all. What are the principles that help UX/UI design agencies build voice interaction in the best possible way? Let’s find out.

Talk and Communicate

The vocal interfaces use the voice, and their scope of action lies in linguistics, pragmatics, and semantics, while the graphic interfaces turn around vision and semiotics.

Every proficient VI designer should know that if you don’t try to understand how humans converse and talk, building a VUI is like a GUI structure without any Gestalt basis and visual perception.

The maxims of Grice

One of the leading theorists of communication and meaning, Paul Herbert Grice, has set four basic rules for conversation between individuals:

  1. Maximum Quality. Tell only the truth and never lie.
  2. Maximize Quantity. Just say requirements without saying too much or too little.
  3. Summary of the Report. Be relevant to the topic being discussed.
  4. The maxim of the way. Speak clearly and unambiguously.

All four maxims use the principle of cooperation:

Connect your conversational contribution to what is requested, at the time it takes place, from the collective intent accepted or from the direction of the verbal exchange in which you are engaged.

The structure of the conversation

The maxims are principles similar to the gestalt ones. They are the cognitive foundation upon which our ability to communicate and interpret an interview is based.

The maxims can also work in particular ways – one can deliberately decide not to respect some of them and obtain alternative communicative effects such as when, following the gestalt rules, our brain confuses them with optical illusions. For example, if I say “You are a lion,” I betray the highest quality (since none of us are feline), but the phrase still has meaning: either it is a metaphor, or it is irony.

Communication does not only make use of the maxims – a speech acquires meaning and content also thanks to other elements:

  • The context, or the environment in which the conversation immersed, which is cultural, social, psychological and physical;
  • The background noise, which can pollute the message physically (like the buzz of the crowd) and psychologically (the mood of the recipient can make the same message interpret in very different ways) and cultural (a particular expression it means something in one culture but something else in another).
  • Speech correlation can be non-verbal communication, such as gesticulation, and para verbal communication can be a tone of voice and inflection.

Respectful and clear communication with the user is our #1 priority in VI design, so following the maxims of Grice is a must.


As good visual-oriented designers as we are, we can’t stop ourselves from comparing voice interfaces to their graphic sisters.

There are, in fact, some important contact points as well as notable differences.

Straight flow

GUIs are the realm of repetition – the screens can be arranged in tree-lined and branched structures of varying complexity, and the user is free and able to move from one screen to another.

The user can navigate and explore a map like a navigator, follow visible indications, and view states and processes independently: the structure has a hierarchical disposition, and multiple contents can be presented simultaneously.

In voice interfaces, visual aids are often in short supply, and we rely on conversation to express states and processes or to navigate the system.

For this reason, a user flow in a VUI is linear and step-by-step:

The user switches from one state to another only thanks to a continuous game of triggers manipulated by the VUI, which is the only one able to access the entire system with its applications directly.

Absence of screens

Both types of interfaces act as intermediaries between a user and a technological system, such as the steering wheel of a car, which allows you to maneuver complex mechanisms using something simple.

  • The GUIs present the triggers visually in the form of buttons, styles, tabs, or text and opens by interacting directly with the graphic material on the screen by mouse or touch.
  • VUIs, on the other hand, have the voice intermediary as their only manipulable component, an application that listens to speech and combines a specific type of output with a particular input which will then be presented, always vocally, to the user and on which the user can act with the voice.

Expression and intention

The components of the voice interface are impalpable and invisible.

  • What could be labels, buttons, icons, and forms are sentences that the user must elaborate on the moment assisted by his short-term memory.
  • Also, the inputs vary: while starting a song on Spotify one must press the “play” button, a person could say to Siri “Sing me …” or “Play …” or even “Start this song”- these are different ways of expressing the same intention.

Nielsen heuristics

For both GUI and VUI, everything defined by Nielsen in his ten heuristics for good usability of the system remains valid, especially in the absence of visual support.


The VUIs are becoming more and more popular thanks to the enhancement of the technologies behind them, the computing power and the AI ??and Machine Learning, capable of understanding conversations and synthesizing speech. In all probability, soon, the need for human-friendly vocal interfaces will continue to grow and spread across different industries.


Skyje is a Blog for Web Designers and Web Developers featuring Social Networking news and everything that Web 2.0. You can Subscribe to Skyje feed.

You may also like...

Leave a Reply