Voice Applications for Alexa and Google Assistant you own this product

Dustin A. Coates
Foreword by Max Amordeluso

July 2019
ISBN 9781617295317
264 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Software Development

table of content

1 Introduction to voice first

1.1 What is voice first?

1.2 Designing for voice UIs

1.3 Anatomy of a voice command

1.3.1 Waking the voice-first device

1.3.2 Introducing natural language processing

1.3.3 How speech becomes text

1.3.4 Intents are the functions of a skill

1.3.5 Training the NLU with sample utterances

1.3.6 Plucking pertinent information from spoken text

1.4 The fulfillment code that ties it all together

1.5 Telling the device what to say

Summary

2 Building a call-and-response skill on Alexa

2.1 Skill metadata

2.1.1 Interaction model

2.1.2 Invocation name

2.1.3 Intents

2.1.4 Sample utterances

2.1.5 Slots

2.2 The interaction model

2.2.1 Building the intent

2.2.2 Slots

2.3 Fulfillment

2.3.1 Hosted endpoint

2.3.2 AWS Lambda

2.3.3 Coding the fulfillment

Summary

3 Designing a voice user interface

3.1 VUI fundamentals

3.2 The cooperative principle

3.2.1 Quantity

3.2.2 Quality

3.2.3 Relation

3.2.4 Manner

3.3 VUI planning

3.4 Variety

Summary

4 Using entity resolution and built-in intents to extend an Alexa skill

4.1 Alexa Skills Kit CLI

4.1.1 Creating an Alexa skill project

4.2 Entity resolution

4.2.1 Fulfillment

4.2.2 Built-in intents

4.2.3 LaunchRequest

4.3 Invoking the skill locally

4.4 Summary

5 Making a conversational Alexa skill

5.1 Creating a conversation

5.1.1 State management

5.1.2 Per-state handlers

5.1.3 Handling the unhandled

5.2 Maintaining long-term information

How attributes are saved

5.3 Putting it all together

5.3.1 New intents

5.3.2 New utterances

5.3.3 New fulfillment

5.3.4 Correcting a mistake

Summary

6 VUI and conversation best practices

6.1 Conversations and context

6.2 A skill with context

6.2.1 Frame-based interactions

6.2.2 The fulfillment

6.2.3 Decaying context

6.3 Intercepting responses and requests

6.4 Unit testing

Summary

7 Using conversation tools to add meaning and usability

7.1 Discourse markers

7.2 Controlling the application’s speech with SSML

7.2.1 Breaks and pauses

7.2.2 Prosody

7.2.3 amazon:effect

7.2.4 w, say-as

SSML implementation differences

7.2.5 phoneme

7.3 Embedding audio

Summary

8 Directing conversation flow

8.1 Guiding user interaction

8.2 Dialog interface

8.2.1 Creating the skill

8.2.2 Setting up the dialog model

Creating a dialog model in the console or through the CLI

8.2.3 Slot filling

8.2.4 Intent confirmation

8.2.5 Dialog model fulfillment

8.3 Handling errors

Summary

9 Building for Google Assistant

9.1 Setting up the application

9.2 Building the interaction model

9.2.1 Building an intent

9.2.2 Testing with the simulator

Necessary Google account settings for simulator testing

9.2.3 Parameters and entities

9.2.4 Adding entities

9.2.5 Using parameters in intents

9.3 Fulfillment

Developing locally without deploying

9.3.1 The code

9.3.2 Deployment

9.3.3 Changing the invocation name

Summary

10 Going multimodal

10.1 Introducing multimodal

10.2 Multimodal in actions

10.2.1 Simple responses

10.2.2 Rich responses

10.2.3 List responses

10.2.4 Suggestion chips

10.3 Surface capabilities

10.4 Multisurface conversations

Summary

11 Push interactions

11.1 Routine suggestions

11.1.1 Storing user data

11.1.2 Action suggestion for a routine

11.2 Daily updates

Phone-based testing

11.2.1 Developer control of daily updates

11.3 Push notifications

11.4 Implicit invocation

Summary

12 Building for actions on Google with the Actions SDK

12.1 Dialogflow and the Actions SDK

12.2 App planning

12.3 The action package

Query pattern arguments

12.4 The fulfillment

12.4.1 Parsing input with regular expressions

12.4.2 Handling the unexpected

Summary

Appendixes

Appendix A: Adding an AWS IAM profile

Appendix B: Connecting DynamoDB to a Lambda function

Overview

1 Introduction to voice first

Voice-first computing moved from a long-held dream to a daily reality with devices like the Amazon Echo, shifting speech from niche, rigid phone trees to open, extensible platforms. This chapter defines voice first as systems primarily operated by voice and opened to third‑party development, bringing the breadth of the web to spoken interactions through skills and actions. It contrasts constrained IVR menus with flexible conversational experiences, highlights the rapid growth of Amazon, Google, and Microsoft ecosystems, and notes the emerging multimodal landscape where screens complement speech while keeping voice at the core.

Designing effective voice user interfaces means applying the principles of good conversation: provide just enough information, ask clarifying follow‑ups, respect context, and maintain an appropriate personality. Because audio can’t show long menus or lists, the burden shifts from users to designers and developers to anticipate needs, make options discoverable, and reduce cognitive load with concise prompts and confirmations. The aim is to help people achieve goals efficiently, handle unexpected inputs gracefully, and create dialogs that feel natural rather than mechanical or rigid.

Under the hood, a voice exchange unfolds in stages: a locally detected wake word begins streaming audio; speech is converted to text; natural language understanding maps utterances to developer‑defined intents and extracts variable details via slots; fulfillment code—often running on serverless platforms like AWS Lambda or Google Cloud Functions—executes logic and returns a response; and text‑to‑speech delivers it, optionally shaped with SSML for pacing, emphasis, and tone. The chapter explains why speech recognition is challenging, how sample utterances train the NLU, and why intents and slots resemble functions and arguments. With these building blocks and JavaScript/Node.js, developers can craft Alexa skills and Google Assistant actions that are powerful, natural, and enjoyable to use.

Figure 1.1. Web flow compared to voice flow

Figure 1.2. Alexa relies on natural language understanding to answer the user’s question.

Figure 1.3. The overall user goal (the intent), the intent-specific variable information (the slot), and how it’s invoked (the utterance)

Summary

Focus on taking the burden of completing an action from the user in voice-first applications.
Data in a conversation flows back and forth between partners to complete an action.
Building a voice application involves reliably directing this data between systems.
Begin to think of requests in terms of intents, slots, and utterances.

FAQ

What does “voice-first” mean?

Voice-first platforms are designed to be used primarily through voice and are open to third‑party extensions built by developers. They emphasize natural, conversational interactions and allow developers to add new capabilities that feel like apps for voice.

How are voice-first platforms different from legacy IVR phone trees?

- IVR systems guide callers down rigid decision trees with a small set of predefined choices.
- Voice-first platforms support open-ended, natural conversation and can be extended with third‑party “skills” or “actions.”
- Rather than forcing users to navigate menus, voice-first systems infer intent and return concise, relevant results—often just one.

Who are the main voice-first platforms, and why isn’t Apple included?

Amazon (Alexa), Google (Assistant), and Microsoft (Cortana) are highlighted because they opened their platforms to third‑party developers. Although Siri and HomePod are popular, Apple did not (in the context of this chapter) open them broadly to third‑party voice app development.

Is voice-first the same as voice-only?

No. Voice-first means voice is the primary interaction, not the only one. Devices like Echo Show or Google Assistant on screens (via Chromecast, smart displays, TVs) enable multimodal experiences that combine voice with visuals, while keeping voice at the center.

What are intents, sample utterances, and slots?

- Intents: The actions a skill can perform (similar to functions in code).
- Sample utterances: Example phrases that teach the NLU how users might express each intent.
- Slots: Variable pieces of information within an intent (like function arguments), often tied to a type (for example, a “Room” slot with values such as kitchen or living room).

How does a spoken request become a response?

- Wake word/phrase is detected locally (for speed and privacy).
- Audio streams to the platform for speech-to-text (ASR).
- Natural language understanding (NLU) identifies the intent and extracts slot values.
- The request is sent to fulfillment (your code) to perform logic and gather data.
- The platform converts the response text (optionally with SSML) back to speech and plays it to the user.

What is fulfillment, and where does it run?

Fulfillment is the backend code that handles a user’s intent, runs business logic, calls APIs, and assembles the response. It commonly runs on serverless platforms like AWS Lambda (for Alexa) or Google Cloud Functions (for Google Assistant), returning a structured response the platform turns into speech.

How do wake words work, and what about privacy?

Devices keep a short rolling audio buffer and locally detect wake words (e.g., “Alexa,” “Hey Google”). Only after the wake word is recognized do they begin streaming audio for processing. Local wake-word detection reduces latency and helps mitigate privacy concerns by limiting what’s sent to the cloud before activation.

What principles make for good VUI (voice user interface) design?

- Be a helpful conversation partner: give just enough information, ask follow‑ups when needed, and handle misunderstandings gracefully.
- Match tone and personality to the context (banking vs. entertainment).
- Use context to reduce unnecessary questions.
- Anticipate user needs and make options discoverable without overwhelming them.

What is SSML, and why use it?

Speech Synthesis Markup Language (SSML) is an XML-based markup for controlling text‑to‑speech. It lets you adjust prosody (rate, pitch, volume), add pauses, specify pronunciations, and more. By shaping how responses are spoken, SSML makes voice interactions clearer, more natural, and more engaging.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more