Introduction to Virtual Assistants

  • Tanay Pant


This chapter gives a detailed overview of what virtual assistants are, common virtual assistants in the market, what qualities a virtual assistant should possess, and the basic workflow and design for building a scalable virtual assistant.


Assistant Development Directory Structure Text String Version Control System Logic Engine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This chapter gives a detailed overview of what virtual assistants are, common virtual assistants in the market, what qualities a virtual assistant should possess, and the basic workflow and design for building a scalable virtual assistant. You also learn about the various tools required to build Melissa (your own virtual assistant) in upcoming chapters and the methodology you follow in this book.

The advent of virtual assistants has been an important event in the history of computing. Virtual assistants are useful for helping the users of a computer system automate tasks and accomplish tasks with minimum human interaction with a machine. The interaction that takes place between a user and a virtual assistant seems natural; the user communicates using their voice, and the software responds in the same way.

If you have seen the movie Iron Man , you can perhaps imagine having a virtual assistant like Tony Stark’s Jarvis. Does that idea excite you? The movie inspired me to build my own virtual assistant software, Melissa. Such a virtual assistant can serve in the Internet of things as well as run a voice-controlled coffee machine or a voice-controlled drone.

Commercial Virtual Assistants

Virtual assistants are useful for carrying out tasks such as saving notes, telling you the weather, playing music, retrieving information, and much more. Following are some virtual assistants that are already available in the market:
  • Google Now: Developed by Google for Android and iOS mobile operating systems. It also runs on computer systems with the Google Chrome web browser. The best thing about this software is its voice-recognition ability.

  • Cortana: Developed by Microsoft and runs on Windows for desktop and mobile, as well as in products by Microsoft such as Band and Xbox One. It also runs on both Android and iOS. Cortana doesn’t entirely rely on voice commands: you can send commands by typing.

  • Siri: Developed by Apple and runs only on iOS, watchOS, and tvOS. Siri is a very advanced personal assistant with lots of features and capabilities.

These are very sophisticated software applications that are proprietary in nature. So, you can’t run them on a Raspberry Pi.

Raspberry Pi

The software you are going to create should be able to run with limited resources. Even though you are developing Melissa for laptop/desktop systems, you will eventually run this on a Raspberry Pi.

The Raspberry Pi is a credit-card-sized, single-board computer developed by the Raspberry Pi Foundation for the purpose of promoting computer literacy among students. The Raspberry Pi has been used by enthusiasts to develop interesting projects of varying genres. In this book, you will build a voice-controlled virtual assistant named Melissa to control this little computer with your voice.

This project uses a Raspberry Pi 2 Model B. You can find information on where to purchase it at . Do not worry if you don’t currently have a Raspberry Pi; you will carry out the complete development of Melissa on a *nix-based system.

How a Virtual Assistant Works

Let’s discuss how Melissa works. Theoretically, such software primarily consists of three components: the speech-to-text (STT) engine, the logic-handling engine, and the text-to-speech (TTS) engine (see Figure 1-1).
Figure 1-1.

Virtual assistant workflow

Speech-to-Text Engine

As the name suggests, the STT engine converts the user’s speech into a text string that can be processed by the logic engine. This involves recording the user’s voice, capturing the words from the recording (cancelling any noise and fixing distortion in the process), and then using natural language processing (NLP) to convert the recording to a text string.

Logic Engine

Melissa’s logic engine is the software component that receives the text string from the STT engine and handles the input by processing it and passing the output to the TTS engine. The logic engine can be considered Melissa’s brain; it handles user queries via a series of if-then-else clauses in the Python programming language. It decides what the output should be in response to specific inputs. You build Melissa’s logic engine throughout the book, improving it and adding new functionalities and features as you go.

Text-to-Speech Engine

This component receives the output from Melissa’s logic engine and converts the string to speech to complete the interaction with the user. TTS is crucial for making Melissa more humane, compared to giving confirmation via text.

This three-component system removes any physical interaction between the user and the machine; the users can interact with their system the same way they interact with other human beings. You learn more about the STT and TTS engines and how to implement them in Chapter  2.

From a high-level view, these are the three basic components that make up Melissa. This book shows you how to do all the necessary programming to develop them and put them together.

Setting Up Your Development Environment

This is a crucial section that is the foundation of the book’s later chapters. You need a computer running a *nix-based operating system such as Linux or OS X. I am using a MacBook Air (early 2015) running OS X 10.11.1 for the purpose of illustration.

Python 2.x

You will write Melissa’s code in the Python programming language. So, you need to have the Python interpreter installed to run the Python code files. *nix systems generally have Python preinstalled. You can check whether you have Python installed by running the following command in the terminal of your operating system:

$ python --version

This command returns the version of the Python installed on your system. In my case, it gives the following output:

Python 2.7.11

This should also work on other versions of Python 2.


I am using Python 2 instead of Python 3 because the various dependencies used throughout the book are written in Python 2.

Python Package Index (PyPI)

You need pip to install the third-party modules that are required for various software operations. You use these third-party modules so you do not have to reinvent the wheels of assorted basic software processes.

You can check whether pip is installed on your system by issuing the following command:

$ pip --version

In my case, it gives this output:

pip 7.1.2 from /usr/local/lib/python2.7/site-packages (python 2.7)

If you do not have pip installed, you can install it by following the guide at .

Version Control System (Git)

You use Git for version control of your software as you work on it, to avoid losing work due to hardware failure or system administrator mistakes. You can use GitHub to upload your Git repository to an online server. You can check whether you have Git installed on your system by issuing the following command:

$ git --version

This command gives me the following output:

git version 2.6.2

If you do not have Git installed, you can install it using the instructions at .


PortAudio is an open source input/output library. It is cross platform and is available in the form of source files that can be downloaded from . It can be compiled on many platforms such as Windows, OS X, and Unix. PortAudio provides a simple API for recording and playing sound that is used by some of the speech-recognition modules in future chapters.


PyAudio provides Python bindings for PortAudio. With the help of this software, you can easily use Python to record and play audio on a variety of platforms, which is exactly what you need for your STT engine. You can find the instructions for installing PyAudio at .

You also need a microphone via which you can speak to your computer (and perform voice recording) and speakers to hear the output. Most modern laptops have these installed by default. For a Raspberry Pi, you need an external microphone and speakers/earphones.

Designing Melissa

You will follow the DRY (don’t repeat yourself) and KISS (keep it simple, stupid) principles and use modular code to design Melissa. Doing so helps maintain your code properly and makes it easier to scale the code in the future when you want to add cool features to your existing codebase. So, let’s first design the structure of your code directories:


In this directory structure, ... denotes that files will be added here in the future as you go through the chapters in this book. The folders containing files are Python packages. The file will be entry point of the program and will contain the source code for the completed STT engine; it will pass commands (in the form of strings) to for handling (this is the logic engine I previously mentioned). The SenseCells package will contain the TTS engine, and the GreyMatter package will contain the various mini-features that can be integrated into the software as you progress through the book. requirements.txt file will be used for keeping tabs on the third-party Python modules you use in this project.

The profile.yaml.default file will store information such as the name of the user as well as the city where the user lives, in YAML format. The profile.yaml file is crucial for executing the file. The user will issue the following to get this software up and running:

$ cp profile.yaml.default profile.yaml

You append the .default suffix so that if users put personal information in the profile.yaml file and create a pull request on GitHub, it won’t include their private changes to the profile.yaml file, because it is mentioned in the .gitignore file.

Currently the contents of profile.yaml.default are as follows:

  New Delhi

The contents of the .gitignore file are as follows:


Now that you know the high-level directory structure of the project, you can go ahead and create the skeleton structure. This structure will help you keep the code base clean and properly organized as you move through the book and work on building new features.

Learning Methodology

This section describes the methodology you use throughout the book: understanding concepts, learning by prototyping, and then developing production-quality code to integrate into the skeleton structure you just developed (see Figure 1-2).
Figure 1-2.

Learning methodology

First you explore the theoretical concepts as well as understand the core principles that will enhance your creativity and help you see different ways to implement features. This part may seem boring to some people, but do not skip these bits.

Next, you implement your acquired knowledge in Python code and play around with it to convert your knowledge into skills. Prototyping will help you to understand the functioning of individual components without the danger of messing up the main codebase. Finally, you edit and refactor the code to create good-quality code that can be integrated with the main codebase to enhance Melissa’s capabilities.


In this chapter, you learned about what virtual assistants are. You also saw various virtual assistants that exist in the commercial market, the features a virtual assistant should possess, and the workflow of a voice-controlled virtual assistant. You designed Melissa’s codebase structure and were introduced to the methodology that this book follows to create an effective learning workflow.

In the next chapter, you study the STT and TTS engines. You implement them in Python to create Melissa’s senses. This lays the foundation of how Melissa will interact with you; you use the functionalities implemented in the next chapter throughout the book.

Supplementary material (652 kb)
Source code (zip 653 kb)

Copyright information

© Tanay Pant 2016

Authors and Affiliations

  • Tanay Pant
    • 1
  1. 1.GhaziabadIndia

Personalised recommendations