Voice-first computing moved from a long-held aspiration to everyday reality with the arrival of devices like Amazon Echo, ushering in platforms where spoken conversation is the primary interface and third-party extensions expand what assistants can do. Compared with rigid phone-tree IVR systems, voice-first assistants offer far more natural, open-ended interactions and can tap web services and developer-built skills or actions. The ecosystem is led by Amazon, Google, and Microsoft, is growing rapidly, and is increasingly multimodal, pairing speech with screens or other outputs while keeping voice as the main mode of interaction.
Designing effective voice user interfaces means treating conversation as the product: adopt an appropriate personality for the context, give the right amount of information at the right time, ask clarifying follow-ups, and help users recover when they go off script. Because users can’t scan menus or lists by ear, the burden shifts from users to designers and developers to make options discoverable, handle flexible input gracefully, and return highly relevant, succinct results. The goal is to anticipate intentions, guide the dialog, and keep interactions efficient and satisfying so users come back.
Under the hood, a voice interaction flows from wake word detection on the device to cloud-based processing that turns speech into text, applies natural language understanding to identify intents and extract variable details (slots), and invokes fulfillment code to do the work—often on serverless platforms like AWS Lambda or Google Cloud Functions. Developers train the NLU with sample utterances tied to intents and slots, call APIs or services in fulfillment, and then respond with text shaped by SSML to control prosody and pronunciation. The loop completes as text-to-speech delivers a clear, natural reply, closing the turn and setting up the next step in the conversation.
Figure 1.1. Web flow compared to voice flow
Figure 1.2. Alexa relies on natural language understanding to answer the user’s question.
Figure 1.3. The overall user goal (the intent), the intent-specific variable information (the slot), and how it’s invoked (the utterance)
Summary
Focus on taking the burden of completing an action from the user in voice-first applications.
Data in a conversation flows back and forth between partners to complete an action.
Building a voice application involves reliably directing this data between systems.
Begin to think of requests in terms of intents, slots, and utterances.
FAQ
What does “voice first” mean?Voice-first platforms are interacted with primarily through voice and are open to third‑party extensions (for example, skills on Alexa or actions on Google). Voice first is not voice only; many experiences are multimodal, combining voice with displays.How are voice-first platforms different from traditional IVR phone trees?IVR systems guide callers down rigid decision trees with a tiny set of choices. Voice-first platforms support natural language, can search the web, and can be extended with third‑party skills, enabling far more open-ended, conversational interactions.Which platforms matter today, and why isn’t Apple on the list?The chapter focuses on Amazon (Alexa), Google (Assistant), and Microsoft (Cortana) because they opened their platforms to third‑party developers. Apple’s Siri/HomePod, while popular, was not opened to third‑party skills at the time discussed.What is the basic flow of a voice command?Wake word detection happens locally, then speech audio streams to the platform for speech‑to‑text. NLU maps text to an intent and extracts slot values. The request is sent to fulfillment code (often serverless). The platform then converts the response text/SSML to speech and plays it back.What are intents, sample utterances, and slots?Intents represent what the user wants to do (like functions). Sample utterances are example phrases that train the NLU to recognize an intent. Slots are variable pieces of information within an utterance (like function arguments), such as a room name or time.How do wake words work, and why are they recognized locally?Devices continuously buffer audio and listen locally for a wake word or phrase (for example, “Alexa,” “Hey Google”). Local detection enables fast wake-up and helps with privacy by streaming audio only after activation.What makes for good VUI (voice user interface) design?Good VUIs mirror natural conversation: provide just enough information, ask clarifying questions, handle misunderstandings gracefully, and adopt an appropriate personality for the context while guiding users when they’re unsure.How does designing for voice differ from designing for web or mobile UIs?Voice shifts effort from users to designers/developers. You can’t show long menus or lists; options must be discoverable and intuitive, inputs handled naturally, and results sharply limited (often to one highly relevant answer).What is fulfillment, and where does it typically run?Fulfillment is the backend code that handles an intent—fetching data, applying logic, and crafting the response. It commonly runs on serverless platforms like AWS Lambda or Google Cloud Functions and returns output (often with SSML) to the assistant.What is SSML and why use it?Speech Synthesis Markup Language lets you control how the assistant speaks—pauses, rate, volume, emphasis, pronunciation, and more—so responses sound natural and clear. You wrap speech in speak tags and add prosody controls as needed.
pro $24.99 per month
access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!