Envisioning Future Voice Interaction

Exploring user experiences with wake-word-free voice commands

Type: Work, Baidu App Voice Version

Time: Sep - Nov 2017

Role: Lead UX Designer

Overview

The future of human-computer interaction extends beyond graphical user interfaces and touch inputs. Multimodal interaction, integrating graphics and voice, is an essential area of exploration. At Baidu App, I explored and experimented with multimodal interaction possibilities for mobile devices.

Background

Baidu App features a sub-product vertical, Audio News, enabling users to listen to text-based articles in their News Feed using Text-to-speech technology. Learn more about my work on Audio News in the other two case studies here and here.

User problem

Upon auditing the Baidu App, I found that a variety of audio content types, including news, books, music, and podcasts, were dispersed across different players. This setup fragmented the listening experience. Further analysis showed significant user overlap among these content types, suggesting the necessity of a unified player to enhance the listening experience.

User research revealed that many users listen to audio content while engaged in tasks that occupy their hands, known as busy-hand scenarios. In such scenarios, users require a complementary way to interact with the product. Introducing voice interaction to our listening experience is one potential solution.

Initial design proposal

I led a project exploring the integration of voice interaction into the Baidu App. The proposal attracted considerable interest from the Baidu App product line. In collaboration with the product team, we developed a product design showcase that captured executive interest and secured strong sponsorship for the project.

Highlights of the proposal:

Allow seamless listening to all audio content in the Baidu App via a unified audio player.
Enable intuitive and efficient voice-activated playback controls, searching, and easy content access.

User testing

This project introduces a new interactive mode to the Baidu App, prompting us to gather user feedback during product development. Our testing revealed two key insights:

Insight 1: Inefficiency of voice commands in playback controls

In standard Voice User Interface (VUI), a Wake-Up-Word (WUW), such as “Hey Siri” or “Hey Google,” is necessary before issuing commands. For Baidu’s VUI, it's “Xiaodu Xiaodu.” For example, the simple action of skipping to the next track. With touch, it's a quick single tap, but with voice, user have to say “Xiaodu Xiaodu, next,” adding unnecessary complexity for four extra syllables. This comparison between the swift touch and the lengthier voice command underscores the inefficiency of voice-based commands for basic playback controls.

Insight 2: Context-awareness in voice queries

Users often engage with content through context-aware questions. Besides basic playback controls, they sometimes search for information related to the current news, such as unfamiliar terms or individuals mentioned in the article. This behavior indicates a need for context-awareness in the app’s search functionality, allowing users to delve deeper into topics of interest seamlessly.

Design iteration

User testing results led us to question the necessity of the Wake-Up-Word (WUW). We asked: Can its use be reduced, or even eliminated in certain cases? What are the criteria for requiring it? Guided by user feedback, we assessed action and information requests based on frequency, ease of touch use*, relevancy to the current context, and feasibility without WUW. This assessment helped us categorize requests into two types: those that can be executed without WUW and those that still require it.

*Ease of touch use refers to how simple and intuitive it is for users to perform certain actions using touch-based interactions in Graphical User Interface.

Introducing the Baidu App Voice Version

The product's voice interaction is more efficient and natural after optimization.

No Wake-Up-Word

For playback controls like pause, play, next, previous, and volume control, users can now say their commands directly without the WUW. They can also ask questions about the current news playing without the WUW, allowing for faster and easier access to supplemental information.

With Wake-Up-Word

For accessing information, content, and services not related to the ongoing audio, users should use the WUW followed by their request. This helps prevent misunderstandings and interruptions by distinguishing user intent. The WUW is essential for broader inquiries to avoid false response and ensure a smooth user experience.

Impact

The project was presented by Baidu founder and CEO Robin Li at Baidu World 2017, the company's most important annual conference. At the beginning of his presentation, Robin said:

“In the future, no Wake-Up-Word is the way to natural voice interaction.”

The product was later launched in app stores in November 2017. Our project team was rewarded with two company awards:

2017 Baidu Outstanding Achievement Award
2017 Baidu Most Creative Award

More projects

Revamping Audio News Infrastructure

Humanizing Machine-Generated Speech

Visualizing Dutch Art Market